2026PPoPP MLIR Tutorial学习

这个月初参加了CGO的LLVM Workshop，在悉尼ICC的2楼做完报告后就到3楼溜达（3楼在开CC，以及PPoPP/HPCA的Workshop和Tutroial），顺带就看了一下MLIR Tutorial。我觉得这个Tutrorial做的不错，讲的是用MLIR实现个简单的Tile，但当天身体不太舒服提前走了，直到这几天才基本完整过一遍Tutroial

IMG_20260131_120150

项目地址：Groverkss/mlir-tutor

按照项目的Readme.md配置即可顺利运行

tutorial-opt注意事项

项目的Opt位于build/tutorial/tutorial-opt

opt输出可以选择--split-input-file将同一个文件的不同Op进行输出分割

如果需要运行运行Python程序，则需要注意TUTORIAL_OPT环境变量是否则正确

通过grep可以查看对应Pass（总感觉这块显示并不怎么好，看看有没有方法改进）

1
tutorial-opt --help | grep "tiny"

MLIR与Python联动

这是这个Tutorial吸引人的地方，其提供了一套非常基础的Python与MLIR的Type与Op绑定

比如下面这个是Ptr绑定

1
class Ptr:
2
    """Wrapper for !tiny.ptr SSA values."""
3
    _value: Value
4

5
    @staticmethod
6
    def _wrap(value: Value) -> "Ptr":
7
        p = Ptr()
8
        p._value = value
9
        return p
10

11
    @staticmethod
12
    def get_type() -> Type:
13
        """Get the !tiny.ptr type."""
14
        return Type.parse("!tiny.ptr")
15

16
    def load(self, offset: Index, num_elements: int) -> F16Vector:
17
        """Load vector<Nxf16> from pointer at offset."""
18
        vec_type = VectorType.get([num_elements], F16Type.get())
19
        op = Operation.create(
20
            "tiny.load",
21
            results=[vec_type],
22
            operands=[self._value, offset._value],
23
        )
24
        return F16Vector._wrap(op.result)
25

26
    def store(self, offset: Index, vec: F16Vector) -> None:
27
        """Store vector<Nxf16> to pointer at offset."""
28
        Operation.create(
29
            "tiny.store",
30
            results=[],
31
            operands=[vec._value, self._value, offset._value],
32
        )

Compile_and_print将Python转化为MLIR并输出

1
def compile_and_print(fn):
2
    """Compile and print all lowering stages."""
3
    opt = TutorialOpt()
4

5
    with MLIRModule() as m:
6
        tiny_ir = m.build_func_verified(fn, _get_type_map(), opt)
7

8
    print("=== Tiny Dialect ===")
9
    print(tiny_ir)
10

11
    arith_ir = opt.run(tiny_ir, ["tiny-to-arith", "canonicalize", "cse"])
12
    print("=== After tiny-to-arith ===")
13
    print(arith_ir)
14

15
    llvm_ir = opt.run(tiny_ir, ["tiny-to-arith", "canonicalize", "cse", "tiny-to-llvm", "convert-to-llvm"])
16
    print("=== LLVM Dialect ===")
17
    print(llvm_ir)
18

19
    return llvm_ir

同样，MLIR的Vector Type也与Numpy的Vector进行绑定（Numpy的array转为mlir的vector），才代码来看方法也一并绑定

1
class F16Vector:
2
    """Wrapper for vector<Nxf16> SSA values."""
3
    _value: Value
4

5
    @staticmethod
6
    def _wrap(value: Value) -> "F16Vector":
7
        vec = F16Vector()
8
        vec._value = value
9
        return vec
10

11
    @staticmethod
12
    def constant(vals: list[float], size: int = None) -> "F16Vector":
13
        """Create a constant f16 vector via tiny.constant."""
14
        n = size or len(vals)
15
        vec_type = VectorType.get([n], F16Type.get())
16
        data = np.array(vals, dtype=np.float16)
17
        attr = DenseElementsAttr.get(data, type=vec_type)
18
        op = Operation.create(
19
            "tiny.constant",
20
            results=[vec_type],
21
            attributes={"value": attr},
22
        )
23
        return F16Vector._wrap(op.result)
24

25
    def _binop(self, other: "F16Vector", op_name: str) -> "F16Vector":
26
        op = Operation.create(
27
            f"tiny.{op_name}",
28
            results=[self._value.type],
29
            operands=[self._value, other._value],
30
        )
31
        return F16Vector._wrap(op.result)
32

33
    def __add__(self, other): return self._binop(other, "addf")
34
    def __sub__(self, other): return self._binop(other, "subf")
35
    def __mul__(self, other): return self._binop(other, "mulf")
36
    def __truediv__(self, other): return self._binop(other, "divf")
37

38
    def sum(self) -> "F16Vector":
39
        """Reduce to vector<1xf16> via tiny.sum."""
40
        result_type = VectorType.get([1], F16Type.get())
41
        op = Operation.create(
42
            "tiny.sum",
43
            results=[result_type],
44
            operands=[self._value],
45
        )
46
        return F16Vector._wrap(op.result)

此外还需要对于MLIRModule进行单独Python包装，以适应不同的场景

1
class MLIRModule:
2
    """Context manager for building MLIR modules with unregistered dialects."""
3

4
    def __init__(self):
5
        self.ctx = None
6
        self.loc = None
7
        self.module = None
8

9
    def __enter__(self):
10
        self.ctx = Context()
11
        self.ctx.allow_unregistered_dialects = True
12
        self.ctx.__enter__()
13
        self.loc = Location.unknown()
14
        self.loc.__enter__()
15
        return self
16

17
    def __exit__(self, *args):
18
        self.loc.__exit__(*args)
19
        self.ctx.__exit__(*args)
20

21
    def build_func(self, fn: Callable, type_map: dict) -> Module:
22
        """Build MLIR module from a Python function.
23

24
        Args:
25
            fn: Function to compile (uses type annotations for args)
26
            type_map: Maps annotation types to (mlir_type, wrapper_class) tuples
27
        """
28
        sig = inspect.signature(fn)
29
        self.module = Module.create()
30

31
        with InsertionPoint(self.module.body):
32
            # Build input types from annotations
33
            input_types = []
34
            for param in sig.parameters.values():
35
                if param.annotation not in type_map:
36
                    raise ValueError(f"Unsupported type: {param.annotation}")
37
                mlir_type, _ = type_map[param.annotation]
38
                input_types.append(mlir_type() if callable(mlir_type) else mlir_type)
39

40
            # Create func.func
41
            func_op = func_d.FuncOp(fn.__name__, (input_types, []))
42

43
            with InsertionPoint(func_op.add_entry_block()):
44
                # Wrap block arguments in DSL types
45
                args = []
46
                for i, param in enumerate(sig.parameters.values()):
47
                    _, wrapper_cls = type_map[param.annotation]
48
                    args.append(wrapper_cls._wrap(func_op.arguments[i]))
49

50
                # Execute user function body
51
                fn(*args)
52

53
                # Add return
54
                func_d.return_([])
55

56
        return self.module
57

58
    def build_func_verified(self, fn: Callable, type_map: dict, opt: "TutorialOpt") -> str:
59
        """Build and verify module, return pretty-printed IR.
60

61
        Runs the generated IR through tutorial-opt to verify it's valid
62
        and get pretty-printed output using the dialect's assembly format.
63
        """
64
        self.build_func(fn, type_map)
65
        raw_ir = str(self.module)
66
        # Round-trip through tutorial-opt to verify and pretty-print
67
        return opt.run(raw_ir, [])

此外还有一些实现细节需要注意

概念	说明
`tutorial-opt`	C++编译的MLIR工具，实现了所有pass（tiny-to-arith、tiny-to-llvm等）
`TutorialOpt` 类	Python包装器，通过 `subprocess` 调用 `tutorial-opt`
`run()` 方法	构建命令行参数，执行 `tutorial-opt` 进程，返回结果
pass 列表	指定要执行的MLIR转换通道
stdin/stdout	通过标准输入传入IR，从标准输出读取转换结果

Chapter1

Pass需要，且也可以在单独的TableGen中定义

1
//===- TinyPasses.td - Tiny dialect passes -----------------*- tablegen -*-===//
2
//
3
// Defines passes for the Tiny dialect.
4
//
5
// Reference: https://mlir.llvm.org/docs/PassManagement/#tablegen-specification
6
//
7
//===----------------------------------------------------------------------===//
8

9
#ifndef TINY_PASSES
10
#define TINY_PASSES
11

12
include "mlir/Pass/PassBase.td"
13

14
def TinyToArith : Pass<"tiny-to-arith"> {
15
  let summary = "Lower Tiny arithmetic operations to arith dialect.";
16
  let description = [{
17
    This pass lowers Tiny dialect arithmetic operations to equivalent operations
18
    in the arith dialect. Memory operations (load/store) and the ptr type are
19
    NOT converted by this pass - use --tiny-to-llvm for that.
20

21
    The lowering includes:
22
    - `tiny.constant` -> `arith.constant`
23
    - `tiny.addf/subf/mulf/divf` -> `arith.addf/subf/mulf/divf`
24
    - `tiny.addi/subi/muli/divi` -> `arith.addi/subi/muli/divsi`
25

26
    Example:
27
    ```mlir
28
    %0 = tiny.addf %a, %b : vector<4xf16>
29
    // Becomes:
30
    %0 = arith.addf %a, %b : vector<4xf16>
31
    ```
32
  }];
33

34
  let dependentDialects = [
35
    "mlir::arith::ArithDialect",
36
    "mlir::vector::VectorDialect"
37
  ];
38
}
39

40
def TinyToLLVM : Pass<"tiny-to-llvm"> {
41
  let summary = "Lower Tiny memory operations and ptr type to LLVM dialect.";
42
  let description = [{
43
    This pass lowers Tiny dialect memory operations and the ptr type to
44
    equivalent operations in the LLVM dialect.
45

46
    The lowering includes:
47
    - `!tiny.ptr` type -> `!llvm.ptr`
48
    - `tiny.load %ptr, %offset` -> GEP to compute byte address, then `llvm.load`
49
    - `tiny.store %val, %ptr, %offset` -> GEP to compute byte address, then `llvm.store`
50

51
    The offset in tiny.load/store is in f16 elements. The lowering converts this
52
    to a byte offset by using GEP with f16 element type:
53
    ```mlir
54
    %0 = tiny.load %ptr, %offset : vector<4xf16>
55
    // Becomes:
56
    %gep = llvm.getelementptr %ptr[%offset] : (!llvm.ptr, i64) -> !llvm.ptr, f16
57
    %0 = llvm.load %gep : !llvm.ptr -> vector<4xf16>
58
    ```
59

60
    Note: Run --tiny-to-arith first to convert arithmetic operations.
61
  }];
62

63
  let dependentDialects = [
64
    "mlir::LLVM::LLVMDialect"
65
  ];
66
}
67

68
#endif // TINY_PASSES

非常规范的Type定义与Operate定义，CPred从的isPowerOf2_64中起着校验参数的作用

1
// Constraint for vector<Nxf16> where N is a power of 2.
2
// Uses VectorOfRankAndType from CommonTypeConstraints.td (included via OpBase.td)
3
// with an additional power-of-2 size check.
4
def Tiny_VectorF16 : Type<
5
  And<[
6
    VectorOfRankAndType<[1], [F16]>.predicate,
7
    CPred<"::llvm::isPowerOf2_64("
8
          "::llvm::cast<::mlir::VectorType>($_self).getDimSize(0))">
9
  ]>,
10
  "vector of f16 with power-of-2 size",
11
  "::mlir::VectorType"
12
>;

类型的定义还可以继承

1
// Constraint for vector<1xf16> (result type for sum operation).
2
// Reuses Tiny_VectorF16 predicate and adds a size=1 constraint.
3
def Tiny_Vector1F16 : Type<
4
  And<[
5
    Tiny_VectorF16.predicate,
6
    CPred<"::llvm::cast<::mlir::VectorType>($_self).getDimSize(0) == 1">
7
  ]>,
8
  "vector<1xf16>",
9
  "::mlir::VectorType"
10
>;

ConstantOp确定了既可以是vector也可以是index

1
def Tiny_ConstantOp : Tiny_Op<"constant", [Pure,
2
                                           AllTypesMatch<["value", "result"]>]> {
3
  let summary = "Creates a constant vector or index value.";
4
  let description = [{
5
    The `tiny.constant` operation creates a constant value which can be either
6
    a vector<Nxf16> or an index.
7

8
    Examples:
9
    ```mlir
10
    %0 = tiny.constant dense<[1.0, 2.0, 3.0, 4.0]> : vector<4xf16>
11
    %1 = tiny.constant 42 : index
12
    ```
13
  }];
14

15
  // TypedAttrInterface allows the type to be inferred from the attribute.
16
  let arguments = (ins TypedAttrInterface:$value);
17
  let results = (outs AnyTypeOf<[Tiny_VectorF16, Index]>:$result);
18

19
  // The attribute itself contains the type, so no need to print it separately.
20
  let assemblyFormat = "attr-dict $value";
21
}

Op的Interface里有SameOperandAndResultType选项

1
class Tiny_IndexBinaryOp<string mnemonic, list<Trait> traits = []> :
2
    Tiny_Op<mnemonic, !listconcat([Pure, SameOperandsAndResultType], traits)> {
3

4
  let arguments = (ins Index:$lhs, Index:$rhs);
5
  let results = (outs Index:$result);
6

7
  let assemblyFormat = "$lhs `,` $rhs attr-dict";
8
}

tiny dialect的运算操作转向arith dialect和vector dialect

1
struct SumOpLowering : public OpRewritePattern<SumOp> {
2
  using OpRewritePattern<SumOp>::OpRewritePattern;
3

4
  LogicalResult matchAndRewrite(SumOp op,
5
                                PatternRewriter &rewriter) const override {
6
    Location loc = op.getLoc();
7
    Value input = op.getInput();
8

9
    // vector.reduction<add> returns a scalar f16.
10
    Value scalarSum = vector::ReductionOp::create(
11
        rewriter, loc, vector::CombiningKind::ADD, input);
12

13
    // Broadcast the scalar to vector<1xf16>.
14
    VectorType resultType = op.getResult().getType();
15
    rewriter.replaceOpWithNewOp<vector::BroadcastOp>(op, resultType, scalarSum);
16

17
    return success();
18
  }
19
};

tiny dialect的指针操作与运算转向LLVM Dialect

1
class TinyToLLVMTypeConverter : public TypeConverter {
2
public:
3
  TinyToLLVMTypeConverter() {
4
    // Identity conversion for all types (fallback).
5
    addConversion([](Type type) { return type; });
6

7
    // Convert tiny.ptr to llvm.ptr.
8
    addConversion([](PtrType type) -> Type {
9
      return LLVM::LLVMPointerType::get(type.getContext());
10
    });
11
  }
12
};
13

14
struct StoreOpToLLVMLowering : public OpConversionPattern<StoreOp> {
15
  using OpConversionPattern<StoreOp>::OpConversionPattern;
16

17
  LogicalResult
18
  matchAndRewrite(StoreOp op, OpAdaptor adaptor,
19
                  ConversionPatternRewriter &rewriter) const override {
20
    Location loc = op.getLoc();
21

22
    // The adaptor provides the converted operands (ptr is now !llvm.ptr).
23
    Value value = adaptor.getValue();
24
    Value ptr = adaptor.getPtr();
25
    Value offset = adaptor.getOffset();
26

27
    // Convert index offset to i64 for GEP.
28
    Type i64Type = rewriter.getI64Type();
29
    Value offsetI64 =
30
        arith::IndexCastOp::create(rewriter, loc, i64Type, offset);
31

32
    // Create GEP with f16 element type to compute the address.
33
    Type f16Type = rewriter.getF16Type();
34
    Type llvmPtrType = LLVM::LLVMPointerType::get(getContext());
35
    Value gep = LLVM::GEPOp::create(rewriter, loc, llvmPtrType, f16Type, ptr,
36
                                    ValueRange{offsetI64});
37

38
    // Store the vector to the computed address.
39
    rewriter.replaceOpWithNewOp<LLVM::StoreOp>(op, value, gep);
40
    return success();
41
  }
42
};

将func dialect设为dynamic legal

1
class TinyToLLVMPass : public impl::TinyToLLVMBase<TinyToLLVMPass> {
2
public:
3
  void runOnOperation() override {
4
    // Set up the type converter.
5
    TinyToLLVMTypeConverter typeConverter;
6

7
    // Set up the conversion target.
8
    ConversionTarget target(getContext());
9

10
    // Mark memory operations as illegal.
11
    target.addIllegalOp<LoadOp, StoreOp>();
12

13
    // Mark LLVM and arith dialects as legal.
14
    target.addLegalDialect<LLVM::LLVMDialect>();
15
    target.addLegalDialect<arith::ArithDialect>();
16

17
    // Mark func dialect operations as dynamically legal if their types are
18
    // converted.
19
    target.addDynamicallyLegalOp<func::FuncOp>([&](func::FuncOp op) {
20
      return typeConverter.isSignatureLegal(op.getFunctionType()) &&
21
             typeConverter.isLegal(&op.getBody());
22
    });
23
    target.addDynamicallyLegalOp<func::ReturnOp>([&](func::ReturnOp op) {
24
      return typeConverter.isLegal(op.getOperandTypes());
25
    });
26

27
    // Set up rewrite patterns.
28
    RewritePatternSet patterns(&getContext());
29

30
    // Add conversion patterns that use the type converter.
31
    patterns.add<LoadOpToLLVMLowering, StoreOpToLLVMLowering>(typeConverter,
32
                                                              &getContext());
33

34
    // Add function signature conversion patterns.
35
    populateFunctionOpInterfaceTypeConversionPattern<func::FuncOp>(
36
        patterns, typeConverter);
37
    populateReturnOpTypeConversionPattern(patterns, typeConverter);
38

39
    // Apply the conversion.
40
    if (failed(applyPartialConversion(getOperation(), target,
41
                                      std::move(patterns))))
42
      signalPassFailure();
43
  }
44
};

Chapter2

这个章节主要是将Tiny转为SCF，对于Python则需要将For循环转为SCF

以Accumulate举例，这会Python的decorate方法转化为tiny_loop.accumulate和tiny_loop.yield

1
 def decorator(body_fn):
2
        # Get init value types and MLIR values
3
        init_values = [v._value for v in inits]
4
        init_types = [v._value.type for v in inits]
5
        result_types = init_types  # Results match init types
6

7
        # Create the accumulate op with one region
8
        op = Operation.create(
9
            "tiny_loop.accumulate",
10
            results=result_types,
11
            operands=[bound._value, step._value] + init_values,
12
            regions=1,  # One region for the body
13
        )
14

15
        # Set up the block with arguments: (index, *iter_args)
16
        region = op.regions[0]
17
        block_arg_types = [IndexType.get()] + init_types
18
        block = Block.create_at_start(region, block_arg_types)
19

20
        # Execute body with wrapped arguments
21
        with InsertionPoint(block):
22
            iv = Index._wrap(block.arguments[0])
23
            iter_args = [_wrap_value(block.arguments[i+1], inits[i])
24
                         for i in range(len(inits))]
25

26
            # Call user's body function
27
            if inits:
28
                results = body_fn(iv, *iter_args)
29
                if not isinstance(results, (list, tuple)):
30
                    results = [results]
31
                yield_values = [r._value for r in results]
32
            else:
33
                body_fn(iv)
34
                yield_values = []
35

36
            # Create tiny_loop.yield
37
            Operation.create("tiny_loop.yield", operands=yield_values)
38

39
        # Wrap and return results
40
        if result_types:
41
            return [_wrap_value(op.results[i], inits[i])
42
                    for i in range(len(result_types))]
43
        return None
44

45
    return decorator

matemul.py则演示了矩阵乘法，如果之前有人看过OpenAI Triton的话应该不会对这种形式感到陌生

这是一个 向量化矩阵乘法的实现，演示如何使用 Ch2 的循环构造（accumulate）来编写高性能的矩阵乘法。

算法：C[M,N] = A[M,K] * B[K,N]^T

关键点：

B被转置，使得A和B都在K维上是连续的（便于向量化加载）

向量大小为16（一次加载16个f16元素）

三层嵌套循环：M维、N维、K维

ch2有一个单独的tiny_loopDialect，实现了SCF的Accumulate（对应scf.for）和Yield(对应scf.yield)

Chapter3

Chapter3实现了一个类似Trition和TileIR的Tile实现

对应的则是tiny_tile这个Dialect，下降时会用的GPU Dialect（只是输出，并不运行）

对于定义Tile这个Type，使用assemblyformat无法实现对应解析要求，需要实现相对应对的parse和print

1
Type TileType::parse(AsmParser &parser) {
2
  if (parser.parseLess())
3
    return Type();
4

5
  // Parse "HxW" as a dimension list. MLIR's lexer treats "64x128" as a single
6
  // dimension list token, so we must use parseDimensionList.
7
  SmallVector<int64_t, 2> dims;
8
  if (parser.parseDimensionList(dims, /*allowDynamic=*/false,
9
                                /*withTrailingX=*/false))
10
    return Type();
11

12
  // We expect exactly 2 dimensions for a 2D tile.
13
  if (dims.size() != 2) {
14
    parser.emitError(parser.getCurrentLocation())
15
        << "expected 2 dimensions for tile, got " << dims.size();
16
    return Type();
17
  }
18

19
  // Parse required comma and layout attribute.
20
  if (parser.parseComma())
21
    return Type();
22

23
  LayoutAttr layout;
24
  if (parser.parseAttribute(layout))
25
    return Type();
26

27
  if (parser.parseGreater())
28
    return Type();
29

30
  return TileType::get(parser.getContext(), dims[0], dims[1], layout);
31
}
32

33
/// Print a tile type: `<` HxW `,` layout `>`
34
void TileType::print(AsmPrinter &printer) const {
35
  printer << "<" << getHeight() << "x" << getWidth() << ", " << getLayout() << ">";
36
}

对于tiny_tile.splat在规划好矩阵的同时，也一并规划好线程

1
#tiny_tile.layout<thread = [1, 32], vector_size = 8>

Layout的Parameter定义，可以看到可以参数使用的是int64_t（在MLIR中这是一个Attribute）

1
let parameters = (ins
2
  ArrayRefParameter<"int64_t">:$thread,
3
  "int64_t":$vectorSize
4
);

Tile的尺寸由thread和vector_size共同计算得到

例子1：thread = [4, 16], vector_size = 4

线程网格：4行 × 16列 = 64个线程 Tile尺寸：4 × (16*4) = 4×64 = 256个元素

（2D的Thread可以利用好GPU Warp的特性）

tiny_tile可以lowering到tiny和tiny_loop，线程部分lowering到GPU Dialect

The lowerings should produce:

LoadOp -> gpu.thread_id + tiny.load (compute per-thread offset from layout)

StoreOp -> gpu.thread_id + tiny.store (compute per-thread offset from layout)

SumOp -> tiny.sum + vector.extract + gpu.subgroup_reduce + vector.broadcast

而对于线程用二维进行表示的原因如下

概念	说明
thread[0]	Y方向（垂直）的线程数 = thread_y的范围
thread[1]	X方向（水平）的线程数 = thread_x的范围
vector_size	每个线程处理的向量宽度
Tile实际尺寸	thread[0] × (thread[1] × vector_size)
映射方向	行优先（Row-major）：先填满X（列），再增加Y（行）
好处	利用GPU硬件特性，提高缓存局部性和执行效率

Chapter3的gpu_dot_product.py在用tiny_tile做乘法

这是一个 GPU并行点积（dot product）计算 的实现，使用 Tile-based DSL 在多个GPU块上进行SPMD（Single Program Multiple Data）计算

头一次见到Op的定义可以带有空格(实际上是当作EnumAttrbute传入)

- tiny_tile.elementwise add -> tiny.addf

- tiny_tile.elementwise sub -> tiny.subf

- tiny_tile.elementwise mul -> tiny.mulf

- tiny_tile.elementwise div -> tiny.divf

1
def TinyTile_EWKind_Add : I32EnumAttrCase<"add", 0>;
2
def TinyTile_EWKind_Sub : I32EnumAttrCase<"sub", 1>;
3
def TinyTile_EWKind_Mul : I32EnumAttrCase<"mul", 2>;
4
def TinyTile_EWKind_Div : I32EnumAttrCase<"div", 3>;
5

6
def TinyTile_ElementwiseKind : I32EnumAttr<
7
    "ElementwiseKind",
8
    "Elementwise operation kind",
9
    [TinyTile_EWKind_Add, TinyTile_EWKind_Sub,
10
     TinyTile_EWKind_Mul, TinyTile_EWKind_Div]> {
11
  let cppNamespace = "::mlir::tiny_tile";
12
  let genSpecializedAttr = 0;
13
}
14

15
def TinyTile_ElementwiseKindAttr :
16
    EnumAttr<TinyTile_Dialect, TinyTile_ElementwiseKind, "ew_kind">;

通过getkind()决定往哪个方向下降

1
LogicalResult ElementwiseOp::convertToSIMT(RewriterBase &rewriter,
2
                                           ValueRange simtOperands) {
3
  Value lhs = simtOperands[0];
4
  Value rhs = simtOperands[1];
5

6
  switch (getKind()) {
7
  case ElementwiseKind::add:
8
    rewriter.replaceOpWithNewOp<tiny::AddFOp>(*this, lhs, rhs);
9
    break;
10
  case ElementwiseKind::sub:
11
    rewriter.replaceOpWithNewOp<tiny::SubFOp>(*this, lhs, rhs);
12
    break;
13
  case ElementwiseKind::mul:
14
    rewriter.replaceOpWithNewOp<tiny::MulFOp>(*this, lhs, rhs);
15
    break;
16
  case ElementwiseKind::div:
17
    rewriter.replaceOpWithNewOp<tiny::DivFOp>(*this, lhs, rhs);
18
    break;
19
  }
20

21
  return success();
22
}

tutorial/ch3-gpu-tile-dsl/TinyTileDialect.cpp实现了很多类型的convertToSIMT方法，文件中的注释值得细看

操作	convertToSIMT 做什么	输入（Tile）	输出（Vector）
SplatOp	创建 per-thread 向量常数	标量值	tiny.constant dense<…> : vector
LoadOp	计算 per-thread 地址偏移	ptr, row, col, stride	%thread_id + 地址计算 → tiny.load
StoreOp	计算 per-thread 地址偏移	value, ptr, row, col, stride	%thread_id + 地址计算 → tiny.store
SumOp	两级归约（本地+跨线程）	vector	tiny.sum + extract + subgroup_reduce + broadcast

Challenge Exercise是做一个tiny_tile的matmul

由于对GPU Dialect不太熟悉，这一章的转化感觉没看懂

Chapter 4

主要关于Linage Dialect和Transform Dialect，从而实现Tile的算子融合

文档中提到Transform Dialect参考的是Halide IR（“Halide like”），这个说法我是第一次听到（如果你再仔细问AI的话，会发现Halide是TVM和MLIR祖先）

Halide非常有意思，因为其对算法和调度进行了区分

范畴 属于“算法”吗？ 属于什么？
加减乘除、数学函数（sin, exp） 是计算逻辑
像素间的依赖关系（如均值模糊） 是计算逻辑
循环的顺序（先行后列，还是分块） 否调度 (Schedule)
是否使用多线程 (Parallel) 否调度 (Schedule)
是否使用向量化 (Vectorize) 否调度 (Schedule)
临时缓冲区的大小和位置 否调度 (Schedule)

“算法与调度分离”有三个核心原因

数学上的正确性：算法部分是容易验证的。只要数学公式对了，无论调度怎么变（分块、并行），计算结果理论上应该是一致的。

模块化：同一个算法可以有多种调度方案。例如，针对手机 CPU 有一套调度，针对高性能 GPU 有另一套调度，但算法代码一行都不用改。

算子融合（Fusion）：因为算法是纯函数描述，编译器可以清晰地看到函数间的依赖关系，从而自动决定是否把两个步骤合并在一起计算，而不需要程序员手动去拆解循环。

范畴	属于“算法”吗？	属于什么？
加减乘除、数学函数（sin, exp）	是	计算逻辑
像素间的依赖关系（如均值模糊）	是	计算逻辑
循环的顺序（先行后列，还是分块）	否	调度 (Schedule)
是否使用多线程 (Parallel)	否	调度 (Schedule)
是否使用向量化 (Vectorize)	否	调度 (Schedule)
临时缓冲区的大小和位置	否	调度 (Schedule)

这部分基本是MLIR官方Tutorial的简化版本，如果实在不明白就看MLIR官网的Transform Dialect