LingoDB源码编译与分析

源码简要分析

2025.3.22 文章内容已过时

2025年2月底，LingoDB已经可以静态Release发布了

预计代码会逐步正规化发展

分析于2024.8.2

~~国内git recursive拖不下来，只能手动下载下来自己组装~~ (使用Docker更省事😋教程在第二章)

vendored: 提供PG解析器，md5,xxhash相关组件

tools：杂项，包括了Docker构建的过程，Python Package构建，MLIR基础设施，便于调试的Source Map，mlir-tools

建议先从mlir-tools看起，SQL->MLIR有意思，这里先贴一段，有空闲时问下大模型

解释下面C++代码

1
#include "frontend/SQL/Parser.h"
2
#include "mlir/Dialect/SubOperator/SubOperatorDialect.h"
3
#include "mlir/Dialect/SubOperator/SubOperatorOps.h"
4
#include "runtime/Session.h"
5
void printMLIR(std::string sql, std::shared_ptr<runtime::Catalog> catalog) {
6
   mlir::MLIRContext context;
7
   mlir::DialectRegistry registry;
8
   registry.insert<mlir::BuiltinDialect>();
9
   registry.insert<mlir::relalg::RelAlgDialect>();
10
   registry.insert<mlir::subop::SubOperatorDialect>();
11
   registry.insert<mlir::tuples::TupleStreamDialect>();
12
   registry.insert<mlir::db::DBDialect>();
13
   registry.insert<mlir::func::FuncDialect>();
14
   registry.insert<mlir::arith::ArithDialect>();
15

16
   registry.insert<mlir::memref::MemRefDialect>();
17
   registry.insert<mlir::util::UtilDialect>();
18
   registry.insert<mlir::scf::SCFDialect>();
19
   registry.insert<mlir::LLVM::LLVMDialect>();
20
   context.appendDialectRegistry(registry);
21
   context.loadAllAvailableDialects();
22
   context.loadDialect<mlir::relalg::RelAlgDialect>();
23
   mlir::OpBuilder builder(&context);
24
   mlir::ModuleOp moduleOp = builder.create<mlir::ModuleOp>(builder.getUnknownLoc());
25
   frontend::sql::Parser translator(sql, *catalog, moduleOp);
26

27
   builder.setInsertionPointToStart(moduleOp.getBody());
28
   auto* queryBlock = new mlir::Block;
29
   {
30
      mlir::OpBuilder::InsertionGuard guard(builder);
31
      builder.setInsertionPointToStart(queryBlock);
32
      auto val = translator.translate(builder);
33
      if (val.has_value()) {
34
         builder.create<mlir::subop::SetResultOp>(builder.getUnknownLoc(), 0, val.value());
35
      }
36
      builder.create<mlir::func::ReturnOp>(builder.getUnknownLoc());
37
   }
38
   mlir::func::FuncOp funcOp = builder.create<mlir::func::FuncOp>(builder.getUnknownLoc(), "main", builder.getFunctionType({}, {}));
39
   funcOp.getBody().push_back(queryBlock);
40

41
   mlir::OpPrintingFlags flags;
42
   flags.assumeVerified();
43
   moduleOp->print(llvm::outs(), flags);
44
}
45
int main(int argc, char** argv) {
46
   std::string filename = std::string(argv[1]);
47
   auto catalog = runtime::Catalog::createEmpty();
48
   if (argc >= 3) {
49
      std::string dbDir = std::string(argv[2]);
50
      catalog = runtime::DBCatalog::create(catalog, dbDir, false);
51
   }
52
   std::ifstream istream{filename};
53
   std::stringstream buffer;
54
   buffer << istream.rdbuf();
55
   while (true) {
56
      std::stringstream query;
57
      std::string line;
58
      std::getline(buffer, line);
59
      while (true) {
60
         if (!buffer.good()) {
61
            if (buffer.eof()) {
62
               query << line << std::endl;
63
            }
64
            break;
65
         }
66
         query << line << std::endl;
67
         if (!line.empty() && line.find(';') == line.size() - 1) {
68
            break;
69
         }
70
         std::getline(buffer, line);
71
      }
72
      printMLIR(query.str(),catalog);
73
      if (buffer.eof()) {
74
         //exit from repl loop
75
         break;
76
      }
77
   }
78
   return 0;
79
}

test：对于Dialect实现的lit(LLVM Integrated Tester)的mlir与sqlite的数据输入

resources：SQL数据与Apache Arrow的相关数据

llvm-project: 关于LLVM的Submodule

**TUM居然自己修改了一版LLVM！！！**确实是研究型数据库🥵（看commit时间应该介于LLVM17-LLVM18之间）

lib: 项目实现的静态文件

Conversion实现论文的层级流转

include：项目的库文件

utility下面是Tracer的相关实现文件

fronetend下面有SQL解析的头文件

execution涉及SQL运行细节（使用Arrow Compute实现，并用Intel的OneAPI进行加速？）

runtime有很多细节实现

runtime/Catalog.h涉及文件读取

runtime/Realtion.h涉及文件关系，Arrow读取则依赖PyArrow的库

让我惊讶的是：.venv/lib/python3.10/site-packages下面的库也被调用了！（即PyArrow包装的C++库）

eval有相关操作的实现

parser.cpp

非常经典的Parser，使用Enum量构建AST树

1
std::optional<mlir::Value> frontend::sql::Parser::translate(mlir::OpBuilder& builder) {
2
   if (result.tree && result.tree->length == 1) {
3
      auto* statement = static_cast<Node*>(result.tree->head->data.ptr_value);
4
      switch (statement->type) {
5
         case T_VariableSetStmt: {
6
            auto* variableSetStatement = reinterpret_cast<VariableSetStmt*>(statement);
7
            translateVariableSetStatement(builder, variableSetStatement);
8
            break;
9
         }
10
         case T_CreateStmt: {
11
            translateCreateStatement(builder, reinterpret_cast<CreateStmt*>(statement));
12
            break;
13
         }
14
         case T_CopyStmt: {
15
            auto* copyStatement = reinterpret_cast<CopyStmt*>(statement);
16
            translateCopyStatement(builder, copyStatement);
17
            break;
18
         }
19
         case T_SelectStmt: {
20
            parallelismAllowed = true;
21
            TranslationContext context;
22
            auto scope = context.createResolverScope();
23
            auto [tree, targetInfo] = translateSelectStmt(builder, reinterpret_cast<SelectStmt*>(statement), context, scope);
24
            //::mlir::Type result, ::mlir::Value rel, ::mlir::ArrayAttr attrs, ::mlir::ArrayAttr columns
25
            std::vector<mlir::Attribute> attrs;
26
            std::vector<mlir::Attribute> names;
27
            std::vector<mlir::Attribute> colMemberNames;
28
            std::vector<mlir::Attribute> colTypes;
29
            auto& memberManager = builder.getContext()->getLoadedDialect<mlir::subop::SubOperatorDialect>()->getMemberManager();
30
            for (auto x : targetInfo.namedResults) {
31
               if (x.first == "primaryKeyHashValue") continue;
32
               names.push_back(builder.getStringAttr(x.first));
33
               auto colMemberName = memberManager.getUniqueMember(x.first.empty() ? "unnamed" : x.first);
34
               auto columnType = x.second->type;
35
               attrs.push_back(attrManager.createRef(x.second));
36
               colTypes.push_back(mlir::TypeAttr::get(columnType));
37
               colMemberNames.push_back(builder.getStringAttr(colMemberName));
38
            }
39
            auto resultTableType = mlir::subop::ResultTableType::get(builder.getContext(), mlir::subop::StateMembersAttr::get(builder.getContext(), builder.getArrayAttr(colMemberNames), builder.getArrayAttr(colTypes)));
40
            return builder.create<mlir::relalg::MaterializeOp>(builder.getUnknownLoc(), resultTableType, tree, builder.getArrayAttr(attrs), builder.getArrayAttr(names));
41
         }
42
         case T_InsertStmt: {
43
            translateInsertStmt(builder, reinterpret_cast<InsertStmt*>(statement));
44
            break;
45
         }
46
         default:
47
           throw std::runtime_error("unsupported statement type");
48
      }
49
   }
50
   return {};
51
}

Docker镜像运行

~~编译环境是不可能折腾的！~~

有Docker的话当然用Docker🤣

1
docker pull ghcr.io/lingo-db/lingo-db:latest
2
docker run -it --name lingo  ghcr.io/lingo-db/lingo-db /bin/bash

编译好的文件默认在build/lingodb-release

貌似跑了TCP-DS？

有一个Lingodbllvm在Python环境里面，所以LingoProject下面是空的

修改下官网给的测试（Data部分需要从外部拷入）

c51ad1708246c6cfd0d992b74331db9 修改后的shell：

1
echo "select * from studenten where name='Carnap'" > test.sql
2
./sql-to-mlir test.sql /repo/resources/data/uni/ > canonical.mlir
3
./mlir-db-opt --use-db /repo/resources/data/uni/ --relalg-query-opt canonical.mlir > optimized.mlir
4
./mlir-db-opt --lower-relalg-to-subop optimized.mlir > subop.mlir
5
./mlir-db-opt --lower-subop subop.mlir > hl-imperative.mlir
6
./mlir-db-opt --lower-db hl-imperative.mlir > ml-imperative.mlir
7
./mlir-db-opt --lower-dsa ml-imperative.mlir > ll-imperative.mlir

制作Debug版本

内存建议大于8GB，编译中出现的任何错误一律按内存不够大处理（问就是试过了）😅

你也可以使用我打包好的的Docker Image：

docker pull ccr.ccs.tencentyun.com/mocusz/lingo-debug

编译完后，安装GDB

1
apt install gdb

配置VScode的Launch.json文件(供参考，有需要的自己添加，主要是arg和cwd)

1
{
2
    // Use IntelliSense to learn about possible attributes.
3
    // Hover to view descriptions of existing attributes.
4
    // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
5
    "version": "0.2.0",
6
    "configurations": [
7
        {
8
            "name": "(gdb) Launch sql-to-mlir",
9
            "type": "cppdbg",
10
            "request": "launch",
11
            "program": "${workspaceFolder}/build/lingodb-debug/sql-to-mlir",
12
            "args": ["./build/lingodb-debug/test.sql","./resources/data/uni/"],
13
            "stopAtEntry": false,
14
            "cwd": "${workspaceFolder}",
15
            "environment": [],
16
            "externalConsole": false,
17
            "MIMode": "gdb",
18
            "setupCommands": [
19
                {
20
                    "description": "Enable pretty-printing for gdb",
21
                    "text": "-enable-pretty-printing",
22
                    "ignoreFailures": true
23
                },
24
                {
25
                    "description": "Set Disassembly Flavor to Intel",
26
                    "text": "-gdb-set disassembly-flavor intel",
27
                    "ignoreFailures": true
28
                }
29
            ]
30
        },
31
    ]
32
}

效果如图：

LingoDB更新commit分析

自2024年4月10日，至2025年1月16日

Jan 16, 2025

这些日子主要在做CI的相关工作

Update Readme

提供了有关build from source的导引（https://www.lingo-db.com/docs/gettingstarted/install/#building-from-source）

是老的，没有意义

添加了Contribute手册，也许可以尝试给他们提PR？

他们修改了Debug方案

https://www.lingo-db.com/docs/ForDevelopers/Debugging

Dec 20, 2024

GPU: fix/improve gpu backend, basic CUDA setup, and gpu properties

更为详细的Cuda支持

Dec 13, 2024

Github CI fully working again

编译器从GCC换为Clang-19（但其特性依然为他们自行修改的LLVM分支）

Reduce patches on top of LLVM (to prepare for switching to built packages in LLVM-20)

小修项目代码

Fix Linter & CI

暴露了一个镜像地址，也许可以拉到最新的镜像：

不能从外部访问

1
gitlab.db.in.tum.de:5005/lingo-db/lingo-db/lingodb-dev:latest

Bump LLVM/MLIR version

大修项目代码，指向这里

https://github.com/lingo-db/llvm-project/tree/patch-sept-2024

Backend: Make EnforceCABIPass work on whole modules

细化PASS：一些原本需要pm.addNestedPass转为pm.addPass

SubOp: Also enable Tracing for execution steps

更新Subop追踪，涉及一个rt::ExecutionStepTracing这个类

Add new tool for analyzing snapshots

增强Analyze

Remove unmaintained cranelift backend

取出CraneLift

Aug 14, 2024

Dependencies: update nlohmann::json to also allow for forward declarations

更新JSON库声明

优化MLIR to JSON的输出

Aug 6, 2024

SubOp: rework Parallelize pass

还再调SubOp并行化

Jul 31, 2024

sql.cpp: support environment variables for easier benchmarking

方便计算SQL运行时间

Jul 19, 2024

SubOp: Rework parallelization to work on ExecutionSteps instead of unstructured SubOp programs

并行工作？

ParallelizePass.cpp发生大改

Jul 10, 2024

fix CI

添加了Numpy依赖，似乎是想做测试？

这是要造轮子么（numpy-mlir）

Jun 3, 2024

Lower Sub-operators based on (nested) execution groups

Prepare Lowering of Sub-Operators using InlineNestedMapPass and FinalizePass

还在修改Sub-operator

Apr 13, 2024 ：

Basic support for GPUs

添加了有关GPU的支持，以Nvidia CUDA为主

Bump LLVM

LLVM大升级，内容大调整（LLVM语法相关）

这个版本的DockerFile居然去掉了Ninja-build

使用的自修改的LingoDB版本，然而这个修改在2024.12月的修改中被洗掉了：https://github.com/lingo-db/llvm-project/commit/f4126023323ce4af4567d7a0fb8079e1424067a2

Apr 10, 2024:

Runtime: Refactor ArrowSchema

Runtime: small improvements for arrow table/column

可以看到做了大量arrow包装：ArrowSchema.cpp

以及有关Apache Arrow的修改工作

Python: add bindings for subop dialect

有关Subop Python Binding

以及关于SubOp的一系列修改

SubOp: implement scans of local tables

SubOp: support params in create_thread_local

SubOp: replace tuples.GetParam at the end

DB/DSA/SubOp: refactor arrow types and column iteration/builder

What hell？

为什么他们会有在if里面赋值这种行为😅

https://github.com/lingo-db/lingo-db/commit/5ed247efa12d4eadd13daaacea2fe616c7afd42e#diff-4207932e824eddb4909fa029b6143a080a587889e359be0d22fb1d7eb7bcc175R666

1
if (auto functionType = mlir::dyn_cast_or_null<mlir::FunctionType>(op.getType())) {