论文阅读:Compile-Time Analysis of Compiler Frameworks for Query Compilation
https://dl.acm.org/doi/10.1109/CGO57630.2024.10444856
主要比较GCC, LLVM, Cranelift, Single-pass compiler这四类JIT引擎的在Database Query Compilation的表现
论文认为 LLVM > Single Pass > Cranlift (与之前看过的文章基本一致)
I. INTRODUCTION
即使同为JavaScript引擎,V8采用的方案是only pairs a bytecode interpreter with an optimizing compiler
,而JavaScriptCore采用的方案是implements three JITcompilation tiers on the top of their interpreter
Our main contributions are:
• A performance comparison and analysis of all back-ends supported by Umbra on both x86-64 and AArch64.
• A thorough and in-depth analysis of the compile-time performance of LLVM and Cranelift.
• An analysis and outline of possible improvements regarding compile-times for all compilation approaches.
• Our steps to adjust our LLVM-IR code to achieve faster compilation times.
使用TPC-DS作为实验数据集
All AArch64 measurements were done on an Apple M1 system, only utilizing the 4 performance cores, equipped with 16 GiB memory, running Asahi Linux 6.2.0.
我之前还在吐槽他们怎么做的实验,原来是真把Mac的系统换成了Linux
II. BACKGROUND: QUERY COMPILATION
To compile a query, it is initially parsed, analyzed, and transformed into a query plan, which is often a tree or DAG.
Query Plan最后都会转为树或DAG
Pipelines can also have dependencies on the results of other pipelines. For example, in Fig. 1a the join operator ⋊⋉2 in the pipeline starting from R3 needs the materialized result of the pipeline from R2. This data-centric approach allows for a fairly straight-forward generation of imperative code.
这里就不得不提到大名鼎鼎的Morsel-driven了,可惜居然没有配图
scans at the bottom of a pipeline do not always iterate over all tuples, but only over a configurable subset [18] (“morsel-driven parallelism”). This way, multiple threads can process different parts of the base table in parallel.
III. UMBRA SYSTEM ARCHITECTURE
Umbra not only employs morsel-driven parallelism, but also compiles each pipeline separately, enabling parallel compilation and more efficient execution scheduling.
哦?是Surprise,似乎两种方案都实现了
As Umbra uses C++ exceptions for error handling, all functions with runtime calls must have associated DWARF
LLVM有实现DWARF的机制,Umbra肯定也借此实现了
以及经典的German-Style String
Umbra’s hash function uses CRC-32 if the target hardware has native support, and otherwise falls back to using 64×64 → 128-bit multiplication,
选用CRC-32作为哈希的方案,Mark下
Umbra IR has no complex or composite data types and only supports integer (8/16/32/64/128 bits), boolean, doubleprecision floating-point, pointer, and the data128 (e.g., for strings) types.
没有提供Composite-Type这个我确实之前没注意过
One key motivation for a custom IR is the ability to also express complex operations as a single instruction … A prime example are arithmetic instructions with overflow detection
(a) trapping instructions
(b) check-and-branch instructions
(c) check instructions,
Umbra IR自行实现了一套分程序支管理的方案
IV. GCC/C BACK-END
Umbra’s GCC/C back-end provides a very high execution performance, but also the longest compilation times.
Umbra的GCC Backend还是头一回见
The C code is written to a temporary file and an external compiler (typically GCC) is called to compile it into a shared library. The shared library is opened using dlopen and the compiled functions are accessed with dlsym. While this approach is conceptually similar to libgccjit
原理?OK
However, due to restrictive licensing, using libgccjit in Umbra is not possible.
啊,竟是这个原因吗?😂
Compile-Time: Due to the large overhead of generating and parsing C code, calling an external process, which in turn invokes the assembler and linker, and the I/O for the intermediate files, we did not further optimize compile times.
额,“we did not further optimize compile times”,好吧
下面这张图标使用GCC12.2.0
Although improvements on this architecture are certainly possible, such changes are unlikely to make the GCC/C back-end competitive to other compilation strategies.
结论:GCC JIT这个方案没什么值得改进的必要😅
V. LLVM BACK-END
Umbra’s LLVM back-end supports two operating modes: an optimized compilation mode intended for producing efficient code and a cheap compilation mode with focus on low compile times.
Cheap对应O0
,Optimized对应O2-O3
For optimized compilations, we configure the back-end to optimization level -O2 and use the SelectionDAG instruction selector;
for cheap compilations we use -O0 and FastISel.
以及使用JITLink,Small-PIC code,
If a full multiplication is required, we do not use the LLVM intrinsic but create a call to our hand-optimized 128-bit multiplication.
Compile-Time: We implemented several measures to reduce the compilation time for cheap builds using FastISel, which improved compilation performance by more than 50% compared to a previous implementation.
50%提升?!
The disadvantage of the Small-PIC model is that runtime calls now require two jumps in machine code (near jump to PLT, indirect jump from there). In practice, however, we could not measure any run-time performance differences.
OK,fine “we could not measure any run-time performance differences.” 😛
Second, instead of representing Umbra IR’s data128 as an LLVM struct {i64,i64}, we always represent it as two separate i64 values in LLVM-IR.
我记得前面讲过:Umbra里没有结构体
This change has several implications:
- The LLVM-IR code becomes shorter as no instructions for extracting/inserting elements need to be generated.
- As the flow of the actual values inside the struct becomes obvious, later analyses need to spend less time for handling opaque structures.
- Avoiding structures leads to substantially fewer fallbacks from FastISel to SelectionDAG: as FastISel only handles LLVM-IR values that fit in a single machine register, every occurrence of this struct type would trigger a fallback for the remainder of the current basic block.
- This change also improves compile times in the optimized build mode by ∼ 7%.
Note that Umbra IR’s int128 is still represented as native LLVM i128.
一套通过降低代码可读性进行提升的方案,如果把i128拆成两个i64,编译时间可以提升7%
However, we also observed that destructing the LLVM module is fairly expensive, taking around 1% of the compilation time in the cheap mode.
1%是Expensive? All right
Instruction Selection: One expensive part of the LLVM back-end is the instruction selection phase, which transforms LLVM-IR into the lower-level Machine IR (MIR).
FastISel only supports common data types that fit in a single machine register and does not implement 128-bit integers or structures.
???
关于Fast中,FastISel中回落SelectionDAG的情况:
On the 103 TPC-DS benchmark queries on x86-64, we observed 3876 fallbacks after our optimizations (cf. Sec. V-A2), accounting for 36% of the instruction selection time. The two main reasons for fallbacks are unsupported intrinsics or function calls (2486) and 128-bit data types (1328). …… We note that we carefully adjusted our LLVM-IR code to avoid SelectionDAG fallbacks where easily possible.
好吧😮
GlobalISel is only supported on a few architectures, including AArch64, where it already is the default back-end
LLVM最新版本的Selection框架,Mark
To compare GlobalISel in its cheap and optimized compilation setting with FastISel/SelectionDAG, we ran the TPCDS benchmarks on AArch64.
The second-most expensive part of the compilation pipeline is register allocation …… First, the MIR is transformed out of SSA by replacing Φ-nodes with copy instructions. Then, two-address machine operations where one source operand is also the destination are rewritten accordingly.
寄存器里面多层分配需要多层Pass,这部分费时间
For cheap builds, LLVM uses the “fast” register allocator, which linearly iterates over all basic blocks and their instructions in reverse order and greedily assigns registers.
For optimized builds, the “greedy” allocator is used, the “greedy” allocator is used, which implements a graph-coloring-based greedy allocation scheme and afterwards performs additional optimizations to improve the assignment and spill placement.
这部分内容在 PLCT的视频 中有提到,可以看出Umbra在这块确实没怎么改
There are three main reasons for the large time spent in this pass.
我的理解是:1.需要在内存中对代码一遍一遍转化,2.抽象和虚函数造成性能损失 3.Symbol和Label是StringBase的,转化需要时间
“In summary, while smaller improvements in compile-time in LLVM appear to be possible, reducing the compile times by an order of magnitude would likely require a fundamentally different approach that puts less emphasis on flexibility and avoids frequent iteration over and rewriting of the program.”
结论:可以改进,但需要很高的的代价😂
VI. CRANELIFT
Cranelift这东西我关注过一段时间,原因是这可以让Rust以JIT形式运行,但后面Cranelift转向了WASM
Cranelift [13] is a compiler back-end developed for fast compilation. Originally developed for the WebAssembly runtime Wasmtime [4], it also strives to be generally usable.
Cranelift aims for a simpler design, it only supports a small set of data types, in particular scalar integers with 8/16/32/64/128 bits
可见Cranelift功能确实不太强
The Cranelift back-end translates Umbra IR to CIR in two passes, first setting up function metadata before translating them
什么,Umbra还有Cranelift版本的后端?😛
相关比较图:
Significant time is attributable to hash map lookups for mapping the Umbra IR value to the corresponding CIR value and translating Umbra IR’s getelementptr operations into integer arithmetic.
主要时间是HashMap和找指针
Surprisingly, the largest part of compilation is spent in the register allocator. This is in contrast to LLVM, which spends significantly less time in this phase and whose register allocator is 42% faster than Cranelift’s
VII. DIRECTEMIT BACK-END
The DirectEmit back-end [14] (formerly referred to as Flying Start) is highly tuned for low compile times and translates Umbra IR to machine code with a single analysis pass and a single code generation pass.
Flying Start?可以看上篇文章阅读 VLDB23阅读:Bringing Compiling Databases to RISC Architectures,TUM他们开发的一个Single-Pass方案
看起来,现在改名为DirectEmit back-end
Figure 5 shows the results. The most expensive part of the analysis is the liveness analysis, taking around 75% of the analysis time.
虽然看起来CodeGeneration时间很长,但少了Pass,确实也不需要太长时间
Umbra IR to machine code can be done as fast as generating the corresponding IR for Cranelift or LLVM.
嗯哼🤔
An AArch64 port was developed [25], but never merged due to the high maintenance effort.
[25]指向了 VLDB23阅读:Bringing Compiling Databases to RISC Architectures,他们也知道是”High Maintenance Effort😂“
VIII. EXECUTION PERFORMANCE
激动人心的跑分时间(Bushi)
For scale factor 10, DirectEmit is almost always the best choice, Cranelift just once. At scale factor 100, however, also LLVM-opt becomes beneficial for several queries despite the large compile time.
IX. DISCUSSION
Cranelift, a framework designed for fast JIT-compilation, compiles just 20–35% faster than LLVM, but comes at a substantial maintenance cost, particularly due to the inter-operation between C++ and Rust
居然提到维护成本😅老实说,C++和Rust作比较的话,有点五十步笑百步(很明显Rust更好啊)
X. RELATED WORK
Frequent optimizations include multi-tiered compilations [2], [3], [28], [29]
第一位自然是JavaScript引擎系列
template-based code generation [29][31],
第二位是Java
and greedy register allocation [32]–[35]
以及LLVM
LLVM [12] provides a JIT compilation functionality, but is regularly avoided due to high compilation cost of the general back-end infrastructure.
指的Webkit放弃LLVM的JIT引擎转而自己开发
个人总结
这篇文章是Umbra大杂烩😂其适配了那么多后端,我是真没想到。
几个月前还想过用Cranlift做JIT引擎实现Query Compilation,但这篇文章确定了这个想法没有太大意义——LLVM基础的JIT依然是Query Optimization的最佳选择
要性能的话当然可以Single-Pass方案,但既然连他们都说存在”high maintenance effort“,作为发烧技术可能很酷,但现实意义不大