由mlir::ExecutionEngine引发的跨系统问题

我手上有一个MLIR项目，项目使用的LLVM版本为20.1.8，之前一直是在x86-64的Debian Linux学校服务器上编程，调试并运行。这两天突发奇想，想把项目的转到新入手的Mac Mini M4上，于是就撞上了这个奇怪的问题🤔

Mac环境配置

通过HomeBrew安装llvm@20

1
brew install llvm@20

为此顺带将项目参数配置改为CMake的Preset

1
{
2
  "version": 3,
3
  "configurePresets": [
4
    {
5
      "name": "macos-llvm20-debug",
6
      "displayName": "MacOS LLVM 20 Debug Config",
7
      "binaryDir": "${sourceDir}/build",
8
      "generator": "Ninja",
9
      "cacheVariables": {
10
          "CMAKE_C_COMPILER": "/opt/homebrew/opt/llvm@20/bin/clang",
11
          "CMAKE_CXX_COMPILER": "/opt/homebrew/opt/llvm@20/bin/clang++",
12
          "LLVM_DIR": "/opt/homebrew/opt/llvm@20/lib/cmake/llvm",
13
          "MLIR_DIR": "/opt/homebrew/opt/llvm@20/lib/cmake/mlir",
14
          "CMAKE_OSX_SYSROOT": "macosx",
15
          "CMAKE_OSX_DEPLOYMENT_TARGET": "26.0"
16
      }
17
    },
18
    {
19
      "name": "macos-llvm20-release",
20
      "displayName": "MacOS LLVM 20 Release Config",
21
      "binaryDir": "${sourceDir}/build",
22
      "generator": "Ninja",
23
      "cacheVariables": {
24
          "CMAKE_C_COMPILER": "/opt/homebrew/opt/llvm@20/bin/clang",
25
          "CMAKE_CXX_COMPILER": "/opt/homebrew/opt/llvm@20/bin/clang++",
26
          "LLVM_DIR": "/opt/homebrew/opt/llvm@20/lib/cmake/llvm",
27
          "MLIR_DIR": "/opt/homebrew/opt/llvm@20/lib/cmake/mlir",
28
          "CMAKE_OSX_SYSROOT": "macosx",
29
          "CMAKE_OSX_DEPLOYMENT_TARGET": "26.0"
30
      }
31
    },
32
    {
33
      "name": "debian-llvm20-debug",
34
      "displayName": "Debian LLVM 20 Debug Config",
35
      "binaryDir": "${sourceDir}/build",
36
      "generator": "Ninja",
37
      "cacheVariables": {
38
          "CMAKE_C_COMPILER": "/usr/lib/llvm-20/bin/clang",
39
          "CMAKE_CXX_COMPILER": "/usr/lib/llvm-20/bin/clang++",
40
          "LLVM_DIR": "/usr/lib/llvm-20/lib/cmake/llvm",
41
          "MLIR_DIR": "/usr/lib/llvm-20/lib/cmake/mlir"
42
      }
43
    },
44
    {
45
      "name": "debian-llvm20-release",
46
      "displayName": "Debian LLVM 20 Release Config",
47
      "binaryDir": "${sourceDir}/build",
48
      "generator": "Ninja",
49
      "cacheVariables": {
50
          "CMAKE_C_COMPILER": "/usr/lib/llvm-20/bin/clang",
51
          "CMAKE_CXX_COMPILER": "/usr/lib/llvm-20/bin/clang++",
52
          "LLVM_DIR": "/usr/lib/llvm-20/lib/cmake/llvm",
53
          "MLIR_DIR": "/usr/lib/llvm-20/lib/cmake/mlir"
54
      }
55
    }
56
  ]
57
}

问题表现

MLIR输出被降级为Generic Form——以这种形式表现的MLIR多半运行会出现问题

1
"builtin.module"() ({
2
  "func.func"() <{function_type = (index) -> !operate.plainaggregatecontext, sym_name = "pipeline_0"}> ({
3
  ^bb0(%arg2: index):
4
    %3 = "operate.plainAggregateInit"() <{agg_value_columns = [[1700 : i32, 0 : i32, 0 : i32]]}> : () -> !operate.plainaggregatecontext
5
    %4 = "operate.scanInit"() <{batch_size = 2048 : i64, cols = ["l_orderkey", "l_partkey", "l_suppkey", "l_linenumber", "l_quantity", "l_extendedprice", "l_discount", "l_tax", "l_returnflag", "l_linestatus", "l_shipdate", "l_commitdate", "l_receiptdate", "l_shipinstruct", "l_shipmode", "l_comment"], table = "lineitem"}> : () -> !operate.scancontext
6
    "scf.while"() ({
7
      %7 = "operate.check_hasMoreBatch"(%4) : (!operate.scancontext) -> i1
8
      "scf.condition"(%7) : (i1) -> ()
9
    }, {
10
      %5 = "operate.scanNext"(%4) : (!operate.scancontext) -> !operate.batch
11
      %6 = "operate.filter"(%5) <{predicate = [...]}> : (!operate.batch) -> !operate.batch
12
      "operate.plainAggregateSource"(%6, %3) <{...}> : (!operate.batch, !operate.plainaggregatecontext) -> ()
13
      "scf.yield"() : () -> ()
14
    }) : () -> ()
15
    "operate.scanDestroy"(%4) : (!operate.scancontext) -> ()
16
    "func.return"(%3) : (!operate.plainaggregatecontext) -> ()
17
  }) : () -> ()
18
  "func.func"() <{function_type = (!operate.plainaggregatecontext) -> !operate.batch, sym_name = "pipeline_1"}> ({
19
  ^bb0(%arg1: !operate.plainaggregatecontext):
20
    %2 = "operate.plainAggregateSink"(%arg1) <{agg_value_works = [[1700 : i32, 0 : i32, 0 : i32]]}> : (!operate.plainaggregatecontext) -> !operate.batch
21
    "func.return"(%2) : (!operate.batch) -> ()
22
  }) : () -> ()
23
}) : () -> ()

正常显示的MLIR应该是下面这样

1
module {
2
  func.func @pipeline_0(%arg0: index) -> !operate.plainaggregatecontext {
3
    %0 = operate.plainAggregateInit([[1700 : i32, 0 : i32, 0 : i32]]) -> !operate.plainaggregatecontext
4
    %1 = operate.scanInit {batch_size = 2048 : i64, cols = ["l_orderkey", "l_partkey", "l_suppkey", "l_linenumber", "l_quantity", "l_extendedprice", "l_discount", "l_tax", "l_returnflag", "l_linestatus", "l_shipdate", "l_commitdate", "l_receiptdate", "l_shipinstruct", "l_shipmode", "l_comment"], table = "lineitem"} : () -> !operate.scancontext
5
    scf.while : () -> () {
6
      %2 = operate.check_hasMoreBatch(%1) : (!operate.scancontext) -> i1
7
      scf.condition(%2)
8
    } do {
9
      %2 = operate.scanNext(%1) : (!operate.scancontext) -> !operate.batch
10
      %3 = operate.filter %2 {predicate = [...]} : (!operate.batch) -> !operate.batch
11
      operate.plainAggregateSource(%3, %0) {agg_detail_works = [...], agg_value_columns = [[1700 : i32, 0 : i32, 0 : i32]], agg_work_columns = [[2114 : i32]]} : (!operate.batch, !operate.plainaggregatecontext) -> ()
12
      scf.yield
13
    }
14
    operate.scanDestroy(%1) : (!operate.scancontext) -> ()
15
    return %0 : !operate.plainaggregatecontext
16
  }
17
  func.func @pipeline_1(%arg0: !operate.plainaggregatecontext) -> !operate.batch {
18
    %0 = operate.plainAggregateSink(%arg0, [[1700 : i32, 0 : i32, 0 : i32]]) -> !operate.batch
19
    return %0 : !operate.batch
20
  }
21
}

如果显示错误但能运行也还好，可下一步MLIR转LLVM IR立马就报SegmentFault，这不要太恼火🤨

问题解决流程

这种没思路的问题，当然是交给AI处理🤓

先用Github Copilot的Gemini Pro尝试，但由于今天是月底。Github Copilot的额度快用光了，在额度仅剩1%的时候把SegmentFault消除了——但代价是CMakeList.txt一片狼藉，对此Gemini Pro的解释如下

脑裂问题 (Split-Brain):

项目同时链接了静态库 (libMLIROperate.a) 和动态库 (MLIRExecutionEngine -> libMLIR.dylib)。

这导致内存中存在两套 MLIR 注册表。Dialect 在其中一个注册表中加载，但打印时使用的是另一个注册表，导致无法识别操作符，从而回退到 Generic Form。

修复: 从 CMakeLists.txt中移除了 MLIRExecutionEngine，强制全静态链接。

符号剥离 (Symbol Stripping):

MacOS 的链接器 (ld64) 默认会剥离静态库中未被显式调用的符号。由于 Dialect 是通过全局构造函数或反射机制注册的，链接器误以为它是无用的代码而将其丢弃。

修复: 在 CMake 中添加了 -Wl,-force_load,$<TARGET_FILE:MLIROperate>，强制链接器包含 MLIROperate 库的所有符号。

构建配置:

修正了 add_llvm_executable 的使用，添加了 PARTIAL_SOURCES_INTENDED。

启用了 -fexceptions 以支持代码

这个回答中，2是错误的，去掉这部分内容的修改不会影响运行。3对了一半，把我原本写成add_executable的代码改成了add_llvm_executable ，这并不是我想看到的，实际撤销后也不影响运行

这里插个眼，似乎使用add_llvm_executable并配上 PARTIAL_SOURCES_INTENDED可以加速编译，具体情况后续可以跟进

问题此刻就来到了1上，使用MLIRExecutionEngine的主要工作就是将MLIR转LLVM IR并进行O3优化，于情于理都不应该影响MLIR的输出，在反复调试后发现问题还真出在这里。

最终的解决方案是：引入LLVM IR的Pass，就地对LLVM Module优化并输出

原本的代码：

1
#include "mlir/ExecutionEngine/ExecutionEngine.h"
2
#include "mlir/ExecutionEngine/OptUtils.h"
3

4
mlir::ExecutionEngine::setupTargetTripleAndDataLayout(llvmModule.get(),
5
                                                      tmOrError.get().get());
6
auto optPipeline =
7
    mlir::makeOptimizingTransformer(3, 0, tmOrError.get().get());
8
if (auto err = optPipeline(llvmModule.get())) {
9
  llvm::errs() << "Failed to optimize LLVM IR " << err << "\n";
10
  return -1;
11
}

现在的代码（optimizeModule是手写的LLVM Pass集合函数）：

1
#include "llvm/Analysis/CGSCCPassManager.h"
2
#include "llvm/Analysis/LoopAnalysisManager.h"
3
#include "llvm/IR/PassManager.h"
4
#include "llvm/Passes/PassBuilder.h"
5
#include "llvm/Passes/StandardInstrumentations.h"
6

7
llvmModule->setDataLayout(tmOrError.get()->createDataLayout());
8
llvmModule->setTargetTriple(tmOrError.get()->getTargetTriple().str());
9

10
optimizeModule(*llvmModule, *tmOrError.get());

问题分析

由于我的Github Copilot的额度已用完，分析工作就交给了Kimi CLI。正好前一天搓了一版MLIR的MCP的MLIR文档RAG派上了用场（我先测试一段时间，如果真的好用的话，也许以后有机会可以讲讲）

结合MLIR文档和调试，解释mlir::ExecutionEngine是如何造成这个问题的？

用户要求深入解释 mlir::ExecutionEngine 是如何具体造成 MLA 格式变化的问题的。这个问题需要：

查看 MLIR 相关文档，了解 ExecutionEngine 如何影响 MLIR 上下文和序列化

分析代码中的具体变化点

结合调试信息，理解 ExecutionEngine 如何改变 MLIR 的默认行为

解释为什么这种改变会导致段错误

让我逐步深入分析。 • 让我深入研究这个问题，查看MLIR代码和调试信息来理解mlir::ExecutionEngine如何影响MLIR生成。 • Used search_mlir_knowledge ({“query”: “ExecutionEngin… generic vs pretty form”}) • Used search_mlir_knowledge ({“query”: “MLIRContext re…nEngine dialect loading”}) • Used search_mlir_knowledge ({“query”: “MLIR builtin m…t generic assembly form”})

……

根据从MLIR文档中学到的知识：

MLIR有两种格式：Pretty（简洁）和 Generic（通用字典格式）

Generic格式在操作未注册或方言未加载时使用

Dialect必须明确加载到MLIRContext中

可以看到Kimi CLI很丝滑的进行了MCP接口的调用，而输出质量，从我使用的感觉上来说，要比没有MCP的输出要强很多——在这之前这种工作，我肯定是要用Claude Opus 4.5的，但现在有了MCP进行RAG后，输出看起来让人放心许多。

最终结论概括如下

• 静态链接副作用：MLIRExecutionEngine引入的依赖链 • 静态初始化器：在main()之前执行，干扰MLIR全局注册表 • 方言注册干扰：破坏自定义printer/parser的查找 • 回退机制：MLIR无法找到自定义printer时回退到通用格式

当MLIRExecutionEngine被链接的时候，其依赖链如下
1
MLIRExecutionEngine
2
    ↓
3
MLIRTargetLLVMIRExport
4
    ↓
5
MLIRLLVMToLLVMIRTranslation
6
    ↓
7
LLVM IR Libraries (with static initializers)

这种静态链接会造成Side Effect破坏MLIR注册的Context，导致自定的MLIR无法被识别，进而输出generic form的MLIR致使SegmentFault

结语

虽然这个解释并没有说明在x86-64 Debian Linux上为何能顺利运行，但尚且还在可接受范围之内——如果细究这个问题，还会牵扯到链接器（linker）实现，符号命名空间，ABI，LTO（链接时间优化），函数执行顺序等一系列有关编译器的问题——而这个问题多半是函数执行顺序变化导致的

所以进一步，还能得出以下结论😂：

为何Linux上工作正常

静态初始化器执行顺序稳定

保守的链接时优化

符号可见性控制更好

初始化器不会被去重

风险即使Linux上现在工作正常，未来可能因以下原因出问题：

工具链升级

静态链接

不同发行版的差异

开启LTO优化

建议：即使Linux上没问题，也应该移除MLIRExecutionEngine，因为：

这是正确的架构分离

避免未来潜在问题

保持跨平台一致性

减少依赖

当然，能看到之前的工作能顺利搬迁到MacOS上，以及这两天写的MCP工具确实在实战中证明有用，这两件事还是值得庆祝的🎉