SIGMOD文章阅读：Apache Calcite——A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources

Apache Calcite is a foundational software framework that provides query processing, optimization, and query language support to many popular open-source data processing systems such as Apache Hive, Apache Storm, Apache Flink, Druid, and MapD.

这篇文章在2025年的CMU-15799的Lecture #20中提及，同时Apache Calcite也是2025年的CMU-15799的Project #1所使用的软件

Apache Calcite项目官网：https://calcite.apache.org/

源码的Mirror: https://github.com/apache/calcite

CMU提供的论文下载地址：https://15799.courses.cs.cmu.edu/spring2025/papers/20-calcite/p221-begoli.pdf

INTRODUCTION

As organizations have invested in data processing systems tailored towards their specific needs, two overarching problems have arisen:

• The developers of such specialized systems have encountered related problems, such as query optimization [4, 25] or the need to support query languages such as SQL and related extensions (e.g., streaming queries [26]) as well as language-integrated queries inspired by LINQ [33]. Without a unifying framework, having multiple engineers independently develop similar optimization logic and language support wastes engineering effort.

• Programmers using these specialized systems often have to integrate several of them together. An organization might rely on Elasticsearch, Apache Spark, and Druid. We need to build systems capable of supporting optimized queries across heterogeneous data sources [55].

一言一概之，就是在Apache Calcite出现之前，Query Optimization是苦活，而且大家还经常造轮子，但这样还没法做到通用🤣

Furthermore, Calcite enables cross-platform optimization by exposing a common interface to multiple systems. To be efficient, the optimizer needs to reason globally, e.g., make decisions across different systems about materialized view selection.

应该是最早提出来这个说法的吧（Apache DataFusion和LingoDB都是后来的事情了

Building a common framework does not come without challenges. In particular, the framework needs to be extensible and flexible enough to accommodate the different types of systems requiring integration.

要想让这样一个Query Optimization的框架具备灵活，可扩展的同时，还想要具备高性能，这可能有些难度😄

Flexible query optimizer. Each component of the optimizer is pluggable and extensible, ranging from rules to cost models. In addition**, Calcite includes support for multiple planning engines. Hence, the optimization can be broken down into phases handled by different optimization engines** depending on which one is best suited for the stage.

优化可以自选阶段运行，这个想法在后面的发展中基本得到了延续

Calcite is reliable, as its wide adoption over many years has led to exhaustive testing of the platform. Calcite also contains an extensive test suite validating all components of the system including query optimizer rules and integration with backend data sources.

不是，有充足测试样例也能写到论文里头？😅

Orca [45] is a modular query optimizer used in data management products such as Greenplum and HAWQ.

听Andy在课上说Orca也不错，但在后面由于收购问题停止开源了

此外还提到了Spark的Catalyst Optimzer，Garlic，FORWARD，BigDAWG，Myria等方案

ARCHITECTURE

Figure 1 outlines the main components of Calcite’s architecture. Calcite’s optimizer uses a tree of relational operators as its internal representation.

可以看到，是一套围绕Java展开的Data Processing Sytem

First, Calcite contains a query parser and validator that can translate a SQL query to a tree of relational operators.

这一套SQL方案借鉴了JavaCC + FreeMarker

QUERY ALGEBRA

For instance, it has become common for OLAP, decision making, and streaming applications to use window definitions to express complex analytic functions such as moving average of a quantity over a time period or number or rows.

似乎对SQL Windows函数有单独优化？

For example, consider joining a Products table held in MySQL to an Orders table held in Splunk (see Figure 2). Initially, the scan of Orders takes place in the splunk convention and the scan of Products is in the jdbc-mysql convention.

？？能跨数据库进行Join操作，Wow🤩(但需要怎么解决不同数据库读取时间的差异？)

ADAPTERS

Figure 3 depicts its components. Essentially, an adapter consists of a model, a schema, and a schema factory.

The model is a specification of the physical properties of the data source being accessed.

A schema is the definition of the data (format and layouts) found in the model.

……

The schema factory component acquires the metadata information from the model and generates a schema.

The data itself is physically accessed via tables.

感觉这个设计模式很Java😂

Calcite uses a physical trait known as the calling convention to identify relational operators which correspond to a specific database backend.

Calling Convention? 记录下

QUERY PROCESSING AND OPTIMIZATION

For an example of a rule with more complex effects, consider the following query:
1
SELECT products.name , COUNT(*)
2
FROM sales
3
JOIN products
4
USING (productId)
5
WHERE sales.discount IS NOT NULL
6
GROUP BY products.name
7
ORDER BY COUNT(*) DESC;
The query corresponds to the relational algebra expression presented in Figure 4a. Because the WHERE clause only applies to the sales table, we can move the filter before the join as in Figure 4b.

That is，下推优化🤔

Metadata providers.

Metadata is an important part of Calcite’s optimizer, and it serves two main purposes:

(i) guiding the planner towards the goal of reducing the cost of the overall query plan, and (ii) providing information to the rules while they are being applied.

Metadata是Query Optimize能运行的关键，无论是FIlter下推（Push Down），还是之前文章里提到的Cardinality Estimation都依赖于Metadata

单独设计一个Metadata Provider我认为是正确的

As the metadata providers are pluggable, they are compiled and instantiated at runtime using Janino [27], a Java lightweight compiler.

还能可插拔？😯

Janino项目地址：https://janino-compiler.github.io/janino/

Planner engines

提供Cost-based planner engine和exhaustive planner

这两者可以任意切换

Materialized views

既然能跨数据库读取，那Materizlize View想必也是可行的

EXTENDING CALCITE

Calcite还可以支持多种多样的数据

怎么感觉这一章是在水😂

Semi-structured Data

For example, Calcite contains an adapter for MongoDB [36], a document store which stores documents consisting of data roughly equivalent to JSON documents.

不奇怪吧，既然MongoDB能实现，那Calcite也实现技术上不是问题

Streaming

Calcite provides first-class support for streaming queries [26] based on a set of streaming-specific extensions to standard SQL, namely STREAM extensions, windowing extensions, implicit references to streams via window expressions in joins, and others.

流式处理则是Spark等一众大数据平台优化的方案

这一块真不熟😅

Geospatial Queries

Geospatial support is preliminary in Calcite, but is being implemented using Calcite’s relational algebra. The core of this implementation consists in adding a new GEOMETRY data type which encapsulates different geometric objects such as points, curves

An example query finds the country which contains the city of Amsterdam:
1
SELECT name FROM (
2
    SELECT name ,  ST_GeomFromText('POLYGON((4.82 52.43, 4.97 52.43, 4.97 52.33, 4.82 52.33, 4.82 52.43))') AS "Amsterdam",
3
    ST_GeomFromText(boundary) AS "Country"
4
    FROM country )
5
    WHERE ST_Contains("Country", "Amsterdam");

看起来像PG的那个GEO

Language-Integrated Query for Java

Though SQL remains the primary database language, many programmers favour language-integrated languages like LINQ

……

Calcite provides Language-Integrated Query for Java (or LINQ4J, in short) which closely follows the convention set forth by Microsoft’s LINQ for the .NET languages.

Calcite用了Linq4j读取Java或数据库数据

Linq建议大家去了解下，我在上面看到了类似Prisma的感觉，确实挺强的（应该是先有Linq，后有的Prisma）

INDUSTRY AND ACADEMIA ADOPTION

论文发表时支持的平台和Adapter的情况

FUTURE WORK

Though Calcite contains a performance testing module, it does not evaluate query execution. It would be useful to assess the performance of systems built with Calcite.

啊这😅这就是这篇文章没有BenchMark的原因？

Based on real-world experience, we believe that more ambitious goals are possible for integrated multiple systems: they should be superior to the sum of their parts.

All right, be calm😆

结论

一篇罕见的，居然没有benchmark的论文

文章篇幅不长，没有让人头昏脑胀的数据，看起来很舒服

一方面，这确实是一个很有意思的工作：一个大一统的SQL适配方案

但另一方面，Query Optimize确实没什么人去做🤔，这一块需要搞技术投入，但产出却十分有限

如果硬要说的话，就是Apache Calcite是否为了通用性，进行了很多性能上的妥协，会不会出现木桶效应？而且使用的还是火山模型，而非比较时兴的PushModel

如果是想系统学习的话，可以看看知乎上的这篇文章：Apache Calcite系列（一）：整体流程解析