LingoDB源码编译与分析
源码简要分析
2025.3.22 文章内容已过时
2025年2月底,LingoDB已经可以静态Release发布了
预计代码会逐步正规化发展
分析于2024.8.2
国内git recursive拖不下来,只能手动下载下来自己组装 (使用Docker更省事😋教程在第二章)
vendored: 提供PG解析器,md5,xxhash相关组件
tools:杂项,包括了Docker构建的过程,Python Package构建,MLIR基础设施,便于调试的Source Map,mlir-tools
建议先从mlir-tools看起,SQL->MLIR有意思,这里先贴一段,有空闲时问下大模型
解释下面C++代码
#include "frontend/SQL/Parser.h"#include "mlir/Dialect/SubOperator/SubOperatorDialect.h"#include "mlir/Dialect/SubOperator/SubOperatorOps.h"#include "runtime/Session.h"void printMLIR(std::string sql, std::shared_ptr<runtime::Catalog> catalog) { mlir::MLIRContext context; mlir::DialectRegistry registry; registry.insert<mlir::BuiltinDialect>(); registry.insert<mlir::relalg::RelAlgDialect>(); registry.insert<mlir::subop::SubOperatorDialect>(); registry.insert<mlir::tuples::TupleStreamDialect>(); registry.insert<mlir::db::DBDialect>(); registry.insert<mlir::func::FuncDialect>(); registry.insert<mlir::arith::ArithDialect>();
registry.insert<mlir::memref::MemRefDialect>(); registry.insert<mlir::util::UtilDialect>(); registry.insert<mlir::scf::SCFDialect>(); registry.insert<mlir::LLVM::LLVMDialect>(); context.appendDialectRegistry(registry); context.loadAllAvailableDialects(); context.loadDialect<mlir::relalg::RelAlgDialect>(); mlir::OpBuilder builder(&context); mlir::ModuleOp moduleOp = builder.create<mlir::ModuleOp>(builder.getUnknownLoc()); frontend::sql::Parser translator(sql, *catalog, moduleOp);
builder.setInsertionPointToStart(moduleOp.getBody()); auto* queryBlock = new mlir::Block; { mlir::OpBuilder::InsertionGuard guard(builder); builder.setInsertionPointToStart(queryBlock); auto val = translator.translate(builder); if (val.has_value()) { builder.create<mlir::subop::SetResultOp>(builder.getUnknownLoc(), 0, val.value()); } builder.create<mlir::func::ReturnOp>(builder.getUnknownLoc()); } mlir::func::FuncOp funcOp = builder.create<mlir::func::FuncOp>(builder.getUnknownLoc(), "main", builder.getFunctionType({}, {})); funcOp.getBody().push_back(queryBlock);
mlir::OpPrintingFlags flags; flags.assumeVerified(); moduleOp->print(llvm::outs(), flags);}int main(int argc, char** argv) { std::string filename = std::string(argv[1]); auto catalog = runtime::Catalog::createEmpty(); if (argc >= 3) { std::string dbDir = std::string(argv[2]); catalog = runtime::DBCatalog::create(catalog, dbDir, false); } std::ifstream istream{filename}; std::stringstream buffer; buffer << istream.rdbuf(); while (true) { std::stringstream query; std::string line; std::getline(buffer, line); while (true) { if (!buffer.good()) { if (buffer.eof()) { query << line << std::endl; } break; } query << line << std::endl; if (!line.empty() && line.find(';') == line.size() - 1) { break; } std::getline(buffer, line); } printMLIR(query.str(),catalog); if (buffer.eof()) { //exit from repl loop break; } } return 0;}
test: 对于Dialect实现的lit(LLVM Integrated Tester)的mlir与sqlite的数据输入
resources:SQL数据与Apache Arrow的相关数据
llvm-project: 关于LLVM的Submodule
**TUM居然自己修改了一版LLVM!!!**确实是研究型数据库🥵(看commit时间应该介于LLVM17-LLVM18之间)
lib: 项目实现的静态文件
Conversion实现论文的层级流转
include:项目的库文件
utility下面是Tracer的相关实现文件
fronetend下面有SQL解析的头文件
execution涉及SQL运行细节(使用Arrow Compute实现,并用Intel的OneAPI进行加速?)
runtime有很多细节实现
runtime/Catalog.h涉及文件读取
runtime/Realtion.h涉及文件关系,Arrow读取则依赖PyArrow的库
让我惊讶的是:.venv/lib/python3.10/site-packages
下面的库也被调用了!(即PyArrow包装的C++库)
eval有相关操作的实现
parser.cpp
非常经典的Parser,使用Enum量构建AST树
std::optional<mlir::Value> frontend::sql::Parser::translate(mlir::OpBuilder& builder) { if (result.tree && result.tree->length == 1) { auto* statement = static_cast<Node*>(result.tree->head->data.ptr_value); switch (statement->type) { case T_VariableSetStmt: { auto* variableSetStatement = reinterpret_cast<VariableSetStmt*>(statement); translateVariableSetStatement(builder, variableSetStatement); break; } case T_CreateStmt: { translateCreateStatement(builder, reinterpret_cast<CreateStmt*>(statement)); break; } case T_CopyStmt: { auto* copyStatement = reinterpret_cast<CopyStmt*>(statement); translateCopyStatement(builder, copyStatement); break; } case T_SelectStmt: { parallelismAllowed = true; TranslationContext context; auto scope = context.createResolverScope(); auto [tree, targetInfo] = translateSelectStmt(builder, reinterpret_cast<SelectStmt*>(statement), context, scope); //::mlir::Type result, ::mlir::Value rel, ::mlir::ArrayAttr attrs, ::mlir::ArrayAttr columns std::vector<mlir::Attribute> attrs; std::vector<mlir::Attribute> names; std::vector<mlir::Attribute> colMemberNames; std::vector<mlir::Attribute> colTypes; auto& memberManager = builder.getContext()->getLoadedDialect<mlir::subop::SubOperatorDialect>()->getMemberManager(); for (auto x : targetInfo.namedResults) { if (x.first == "primaryKeyHashValue") continue; names.push_back(builder.getStringAttr(x.first)); auto colMemberName = memberManager.getUniqueMember(x.first.empty() ? "unnamed" : x.first); auto columnType = x.second->type; attrs.push_back(attrManager.createRef(x.second)); colTypes.push_back(mlir::TypeAttr::get(columnType)); colMemberNames.push_back(builder.getStringAttr(colMemberName)); } auto resultTableType = mlir::subop::ResultTableType::get(builder.getContext(), mlir::subop::StateMembersAttr::get(builder.getContext(), builder.getArrayAttr(colMemberNames), builder.getArrayAttr(colTypes))); return builder.create<mlir::relalg::MaterializeOp>(builder.getUnknownLoc(), resultTableType, tree, builder.getArrayAttr(attrs), builder.getArrayAttr(names)); } case T_InsertStmt: { translateInsertStmt(builder, reinterpret_cast<InsertStmt*>(statement)); break; } default: throw std::runtime_error("unsupported statement type"); } } return {};}
Docker镜像运行
编译环境是不可能折腾的!
有Docker的话当然用Docker🤣
docker pull ghcr.io/lingo-db/lingo-db:latestdocker run -it --name lingo ghcr.io/lingo-db/lingo-db /bin/bash
编译好的文件默认在build/lingodb-release
貌似跑了TCP-DS?
有一个Lingodbllvm在Python环境里面,所以LingoProject下面是空的
修改下官网给的测试(Data部分需要从外部拷入)
修改后的shell:
echo "select * from studenten where name='Carnap'" > test.sql./sql-to-mlir test.sql /repo/resources/data/uni/ > canonical.mlir./mlir-db-opt --use-db /repo/resources/data/uni/ --relalg-query-opt canonical.mlir > optimized.mlir./mlir-db-opt --lower-relalg-to-subop optimized.mlir > subop.mlir./mlir-db-opt --lower-subop subop.mlir > hl-imperative.mlir./mlir-db-opt --lower-db hl-imperative.mlir > ml-imperative.mlir./mlir-db-opt --lower-dsa ml-imperative.mlir > ll-imperative.mlir
制作Debug版本
内存建议大于8GB,编译中出现的任何错误一律按内存不够大处理(问就是试过了)😅
你也可以使用我打包好的的Docker Image:
docker pull ccr.ccs.tencentyun.com/mocusz/lingo-debug
编译完后,安装GDB
apt install gdb
配置VScode的Launch.json
文件(供参考,有需要的自己添加,主要是arg
和cwd
)
{ // Use IntelliSense to learn about possible attributes. // Hover to view descriptions of existing attributes. // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387 "version": "0.2.0", "configurations": [ { "name": "(gdb) Launch sql-to-mlir", "type": "cppdbg", "request": "launch", "program": "${workspaceFolder}/build/lingodb-debug/sql-to-mlir", "args": ["./build/lingodb-debug/test.sql","./resources/data/uni/"], "stopAtEntry": false, "cwd": "${workspaceFolder}", "environment": [], "externalConsole": false, "MIMode": "gdb", "setupCommands": [ { "description": "Enable pretty-printing for gdb", "text": "-enable-pretty-printing", "ignoreFailures": true }, { "description": "Set Disassembly Flavor to Intel", "text": "-gdb-set disassembly-flavor intel", "ignoreFailures": true } ] }, ]}
效果如图:
LingoDB更新commit分析
自2024年4月10日,至2025年1月16日
Jan 16, 2025
这些日子主要在做CI的相关工作
提供了有关build from source的导引(https://www.lingo-db.com/docs/gettingstarted/install/#building-from-source)
是老的,没有意义
添加了Contribute手册,也许可以尝试给他们提PR?
他们修改了Debug方案
https://www.lingo-db.com/docs/ForDevelopers/Debugging
Dec 20, 2024
GPU: fix/improve gpu backend, basic CUDA setup, and gpu properties
更为详细的Cuda支持
Dec 13, 2024
编译器从GCC换为Clang-19(但其特性依然为他们自行修改的LLVM分支)
Reduce patches on top of LLVM (to prepare for switching to built packages in LLVM-20)
小修项目代码
暴露了一个镜像地址,也许可以拉到最新的镜像:
不能从外部访问
gitlab.db.in.tum.de:5005/lingo-db/lingo-db/lingodb-dev:latest
大修项目代码,指向这里
https://github.com/lingo-db/llvm-project/tree/patch-sept-2024
Backend: Make EnforceCABIPass work on whole modules
细化PASS:一些原本需要pm.addNestedPass
转为pm.addPass
SubOp: Also enable Tracing for execution steps
更新Subop追踪,涉及一个rt::ExecutionStepTracing
这个类
Add new tool for analyzing snapshots
增强Analyze
Remove unmaintained cranelift backend
取出CraneLift
Aug 14, 2024
Dependencies: update nlohmann::json to also allow for forward declarations
更新JSON库声明
优化MLIR to JSON的输出
Aug 6, 2024
SubOp: rework Parallelize pass
还再调SubOp并行化
Jul 31, 2024
sql.cpp: support environment variables for easier benchmarking
方便计算SQL运行时间
Jul 19, 2024
SubOp: Rework parallelization to work on ExecutionSteps instead of unstructured SubOp programs
并行工作?
ParallelizePass.cpp发生大改
Jul 10, 2024
添加了Numpy依赖,似乎是想做测试?
这是要造轮子么(numpy-mlir)
Jun 3, 2024
Lower Sub-operators based on (nested) execution groups
Prepare Lowering of Sub-Operators using InlineNestedMapPass and FinalizePass
还在修改Sub-operator
Apr 13, 2024 :
添加了有关GPU的支持,以Nvidia CUDA为主
LLVM大升级,内容大调整(LLVM语法相关)
这个版本的DockerFile居然去掉了Ninja-build
使用的自修改的LingoDB版本,然而这个修改在2024.12月的修改中被洗掉了:https://github.com/lingo-db/llvm-project/commit/f4126023323ce4af4567d7a0fb8079e1424067a2
Apr 10, 2024:
Runtime: small improvements for arrow table/column
可以看到做了大量arrow包装:ArrowSchema.cpp
以及有关Apache Arrow的修改工作
Python: add bindings for subop dialect
有关Subop Python Binding
以及关于SubOp的一系列修改
SubOp: implement scans of local tables
SubOp: support params in create_thread_local
SubOp: replace tuples.GetParam at the end
DB/DSA/SubOp: refactor arrow types and column iteration/builder
What hell?
为什么他们会有在if里面赋值这种行为😅
if (auto functionType = mlir::dyn_cast_or_null<mlir::FunctionType>(op.getType())) {