Paper name: Designing an Open Framework for Query Optimization and Compilation

Address: https://dl.acm.org/doi/10.14778/3551793.3551801

Website:https://www.lingo-db.com/

Github project address: https://github.com/lingo-db/lingo-db

LingoDB is a database project based on the development of MLIR, its underlying storage using Apache Arrow implementation, the database of various types of Query into the Dialect (Dialect) of the different operators (Operator), through different layers of intermediate code conversion, complete the execution plan from the logical plan to the physical plan transformation.

image-20240726174502000

Technical Background

MLIR (Multi-Level Intermediate Representation) is a compiler infrastructure project initiated by Google, whose goal is to provide a unified framework for compilation optimisation of diverse hardware and high-level languages, and is currently one of the sub-projects of the LLVM project.

Its storage structure uses Apache Arrow, an open source project that uses a columnar memory format for efficient cross-platform and cross-language data processing.

Implementation details

LingoDB through the dialect (Dialect) layer by layer degradation (Lowering) make SQL, Pytorch code into executable code , the following will show the code conversion process:

image-20240726175352825

The first step is to convert SQL and PyTorch code into abstract logic algebra and tuple operations (relalg), database miscellaneous operations (db), and mathematical operations (atith) in the form of MLIR, where a query optimisation scheme can be developed to achieve quality optimisation of the MLIR code.

image-20240726174918903

With MLIR’s Rewrite mechanism, it is possible to achieve

  1. code simplification (Simplification) to merge redundant logic and eliminate invalid branches
  2. Unnesting, which converts nested queries (e.g., subqueries) into flat join operations.
  3. Pushdown, the complex conditions are split into smaller filter (filter) unit at the same time, adjust the order of operations, so that the filtering and projection (Project) as soon as possible, thus reducing the amount of data for subsequent processing
  4. Join Ordering, according to the MLIR analysis of the relationship between the table, set the dynamic planning or heuristic algorithms to generate the appropriate Join Ordering
  5. Physical Opt, choose the physical implementation of the operator (such as Hash Join, Sort-Merge Join)

image-20250707103736600

The second step implements the transformation of the logical algebra (lower-relalg) to the MLIR implementation of the abstract logical algebra to the concrete logical algebra (dsa)

The third step implements the transformation of miscellaneous database operations and mathematical operations (lower-db,lower-dsa), which are converted into a machine-friendly MLIR implementation (program control flow scf)

Finally, MLIR code will be convert to LLVM IR, with the help of LLVM’s JIT runtime framework to complete the JIT compilation, read the Apache Arrow file, and realise the data reading operation.

Experimental results

LingoDB has less code for query optimisation than DuckDB and NoisePage.

image-20250707100231436

And in terms of compilation time, the compilation time is higher than Hyper database due to the compilation mechanism of MLIR

image-20250707095900080

And on the TPC-H SF1 dataset, it runs faster than Hyper over DuckDB, especially on Q2, Q6, Q11

image-20250707100256820

Conclusion

The biggest advantage of this project is the integration with the MLIR ecosystem, which is conducive to the integration of AI and the database ecosystem, and the concept of ‘layer-by-layer descent’ in the optimisation, which makes the boundary of the query optimisation clearer and the implementation more controllable, and is also conducive to the implementation of UDF (User Define Function).

However, the use of MLIR undoubtedly raises the threshold of database development, and deep participation requires developers to span the database system and compiler technology (MLIR Dialect, LLVM) two major fields, and need to deal with the balance between development efficiency and maintainability.