SIGMOD文章阅读:Apache Calcite——A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources
发表于 | 更新于
Apache Calcite is a foundational software framework that provides query processing, optimization, and query language support to many popular open-source data processing systems such as Apache Hive, Apache Storm, Apache Flink, Druid, and MapD.
As organizations have invested in data processing systems tailored towards their specific needs, two overarching problems have arisen:
• The developers of such specialized systems have encountered related problems, such as query optimization [4, 25] or the need to support query languages such as SQL and related extensions (e.g., streaming queries [26]) as well as language-integrated queries inspired by LINQ [33]. Without a unifying framework, having multiple engineers independently develop similar optimization logic and language support wastes engineering effort.
• Programmers using these specialized systems often have to integrate several of them together. An organization might rely on Elasticsearch, Apache Spark, and Druid. We need to build systems capable of supporting optimized queries across heterogeneous data sources [55].
Furthermore, Calcite enables cross-platform optimization by exposing a common interface to multiple systems. To be efficient, the optimizer needs to reason globally, e.g., make decisions across different systems about materialized view selection.
应该是最早提出来这个说法的吧(Apache DataFusion和LingoDB都是后来的事情了
Building a common framework does not come without challenges. In particular, the framework needs to be extensible and flexible enough to accommodate the different types of systems requiring integration.
Flexible query optimizer. Each component of the optimizer is pluggable and extensible, ranging from rules to cost models. In addition**, Calcite includes support for multiple planning engines. Hence, the optimization can be broken down into phases handled by different optimization engines** depending on which one is best suited for the stage.
优化可以自选阶段运行,这个想法在后面的发展中基本得到了延续
Calcite is reliable, as its wide adoption over many years has led to exhaustive testing of the platform. Calcite also contains an extensive test suite validating all components of the system including query optimizer rules and integration with backend data sources.
不是,有充足测试样例也能写到论文里头?😅
RELATED WORK
Orca [45] is a modular query optimizer used in data management products such as Greenplum and HAWQ.
Figure 1 outlines the main components of Calcite’s architecture. Calcite’s optimizer uses a tree of relational operators as its internal representation.
可以看到,是一套围绕Java展开的Data Processing Sytem
First, Calcite contains a query parser and validator that can translate a SQL query to a tree of relational operators.
这一套SQL方案借鉴了JavaCC + FreeMarker
QUERY ALGEBRA
For instance, it has become common for OLAP, decision making, and streaming applications to use window definitions to express complex analytic functions such as moving average of a quantity over a time period or number or rows.
似乎对SQL Windows函数有单独优化?
For example, consider joining a Products table held in MySQL to an Orders table held in Splunk (see Figure 2). Initially, the scan of Orders takes place in the splunk convention and the scan of Products is in the jdbc-mysql convention.
??能跨数据库进行Join操作,Wow🤩(但需要怎么解决不同数据库读取时间的差异?)
ADAPTERS
Figure 3 depicts its components. Essentially, an adapter consists of a model, a schema, and a schema factory.
The model is a specification of the physical properties of the data source being accessed.
A schema is the definition of the data (format and layouts) found in the model.
……
The schema factory component acquires the metadata information from the model and generates a schema.
The data itself is physically accessed via tables.
感觉这个设计模式很Java😂
Calcite uses a physical trait known as the calling convention to identify relational operators which correspond to a specific database backend.
Calling Convention? 记录下
QUERY PROCESSING AND OPTIMIZATION
For an example of a rule with more complex effects, consider the following query:
1
SELECT products.name , COUNT(*)
2
FROM sales
3
JOIN products
4
USING (productId)
5
WHERE sales.discount IS NOT NULL
6
GROUP BY products.name
7
ORDER BYCOUNT(*) DESC;
The query corresponds to the relational algebra expression presented in Figure 4a. Because the WHERE clause only applies to the sales table, we can move the filter before the join as in Figure 4b.
That is,下推优化🤔
Metadata providers.
Metadata is an important part of Calcite’s optimizer, and it serves two main purposes:
(i) guiding the planner towards the goal of reducing the cost of the overall query plan, and (ii) providing information to the rules while they are being applied.
For example, Calcite contains an adapter for MongoDB [36], a document store which stores documents consisting of data roughly equivalent to JSON documents.
不奇怪吧,既然MongoDB能实现,那Calcite也实现技术上不是问题
Streaming
Calcite provides first-class support for streaming queries [26] based on a set of streaming-specific extensions to standard SQL, namely STREAM extensions, windowing extensions, implicit references to streams via window expressions in joins, and others.
流式处理则是Spark等一众大数据平台优化的方案
这一块真不熟😅
Geospatial Queries
Geospatial support is preliminary in Calcite, but is being implemented using Calcite’s relational algebra. The core of this implementation consists in adding a new GEOMETRY data type which encapsulates different geometric objects such as points, curves
An example query finds the country which contains the city of Amsterdam:
1
SELECT name FROM (
2
SELECT name , ST_GeomFromText('POLYGON((4.82 52.43, 4.97 52.43, 4.97 52.33, 4.82 52.33, 4.82 52.43))') AS "Amsterdam",
3
ST_GeomFromText(boundary) AS "Country"
4
FROM country )
5
WHERE ST_Contains("Country", "Amsterdam");
看起来像PG的那个GEO
Language-Integrated Query for Java
Though SQL remains the primary database language, many programmers favour language-integrated languages like LINQ
……
Calcite provides Language-Integrated Query for Java (or LINQ4J, in short) which closely follows the convention set forth by Microsoft’s LINQ for the .NET languages.
Though Calcite contains a performance testing module, it does not evaluate query execution. It would be useful to assess the performance of systems built with Calcite.
啊这😅这就是这篇文章没有BenchMark的原因?
Based on real-world experience, we believe that more ambitious goals are possible for integrated multiple systems: they should be superior to the sum of their parts.