目前DuckDB两个版本的Sort均是由Laurens Kuiper(Software Developer at DuckDB Labs and Ph.D. Student in the Database Architectures group at CWI.)写的

因此DuckDB的Sort分为两个文件夹

src/common目录下,有sortsorting文件夹

v1.4.0的排序方法位于sorting目录下(具体方法则位于third_party/ska_sortthird_party/verge_sortthird_party/pdq_sort下面),对应的状态管理是

class SortLocalSinkState;
class SortGlobalSinkState;
class SortLocalSourceState;
class SortGlobalSourceState;

而v1.4.0之前的排序方法位于sort目录下,对应的状态管理是

struct GlobalSortState
struct LocalSortState

调研主要以TPC-H Q1为主,结合Debug版本的编译与GDB,观察程序的实际执行路径

SELECT
l_returnflag,
l_linestatus,
sum(l_quantity) AS sum_qty,
sum(l_extendedprice) AS sum_base_price,
sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
avg(l_quantity) AS avg_qty,
avg(l_extendedprice) AS avg_price,
avg(l_discount) AS avg_disc,
count(*) AS count_order
FROM
lineitem
WHERE
l_shipdate <= CAST('1998-09-02' AS date)
GROUP BY
l_returnflag,
l_linestatus
ORDER BY
l_returnflag,
l_linestatus;

最后结果的Aggregate分成4组,结果应该是下面情况

image-20250920221110138

PhysicalPlanGenerator::CreatePlan中,会将逻辑计划中的OrderBy转为物理计划,我们GDB从这里开始切入

case LogicalOperatorType::LOGICAL_ORDER_BY:
return CreatePlan(op.Cast<LogicalOrder>());

前置知识

Key与Payload

Key为排序键(参与排序的数据),Payload则是不参与排序的数据

  1. 分离存储:key数据和payload数据分别存储在不同的集合中
  2. 指针关联:通过在排序键中存储payload的指针来建立关联
  3. 类型特化:针对不同的排序键类型(固定16字节、24字节、32字节、变长32字节等)进行优化
  4. 延迟绑定:在数据追加时立即建立key-payload关联,为后续排序做准备

如果是存在多个列需要Sort,就会把多列的数据组合为一行数据,这样排序只需要做一次(就是以行为单位做Sort)

DuckDB v1.4.0的做法

定义了新的枚举类SortKetyType

enum class SortKeyType : uint8_t {
INVALID = 0,
//! Without payload
NO_PAYLOAD_FIXED_8 = 1,
NO_PAYLOAD_FIXED_16 = 2,
NO_PAYLOAD_FIXED_24 = 3,
NO_PAYLOAD_FIXED_32 = 4,
NO_PAYLOAD_VARIABLE_32 = 5,
//! With payload (requires row pointer in key)
PAYLOAD_FIXED_16 = 6,
PAYLOAD_FIXED_24 = 7,
PAYLOAD_FIXED_32 = 8,
PAYLOAD_VARIABLE_32 = 9,
};

sort.cpp

Sort的运行函数如下:

对于orders,这是一个std::vector

运行Q1时orders.size()为2,与order by的列数一致

如果进一步检查其中的expressions,可以发现是l_returnflagl_linestatus

orders.GetOrderModifieer()获取ASCDESC

order.expression->return_type获取列的类型

将以上信息传入create_children,作为构成Key组合键的信息

通过binder.BindScalarFunction(DEFAULT_SCHEMA, "create_sort_key", std::move(create_children), error)生成reate_sort_key——对于Q1,进入LogicalTypeId::BIGINT分支

同时紧跟在这段代码后面,生成decode_sort_key

decode_sort_key->return_type = LogicalType::STRUCT(std::move(decode_child_list));

根据projection_map生成payload_layout(这是一种行存形式tupleDataLayout)和填充output_projection_columns

vector<LogicalType> payload_types;
for (idx_t output_col_idx = 0; output_col_idx < projection_map.size(); output_col_idx++) {
const auto &input_col_idx = projection_map[output_col_idx];
const auto it = input_column_to_key.find(input_col_idx);
if (it != input_column_to_key.end()) {
// Projected column also appears as a key, just reference it
output_projection_columns.push_back({false, it->second, output_col_idx});
} else {
// Projected column does not appear as a key, add to payload layout
output_projection_columns.push_back({true, payload_types.size(), output_col_idx});
payload_types.push_back(input_types[input_col_idx]);
input_projection_map.push_back(input_col_idx);
}
}
payload_layout->Initialize(payload_types, TupleDataValidityType::CAN_HAVE_NULL_VALUES);

然后用std::sort对于输出用的project_column进行排序,将Key的那部分移动到vector的开头

std::sort(output_projection_columns.begin(), output_projection_columns.end(),
[](const SortProjectionColumn &lhs, const SortProjectionColumn &rhs) {
if (lhs.is_payload == rhs.is_payload) {
return lhs.layout_col_idx < rhs.layout_col_idx;
}
return lhs.is_payload < rhs.is_payload;
});

完整代码如下

Sort::Sort(ClientContext &context, const vector<BoundOrderByNode> &orders, const vector<LogicalType> &input_types,
vector<idx_t> projection_map, bool is_index_sort_p)
: key_layout(make_shared_ptr<TupleDataLayout>()), payload_layout(make_shared_ptr<TupleDataLayout>()),
is_index_sort(is_index_sort_p) {
// Convert orders to a single "create_sort_key" expression (and corresponding "decode_sort_key")
FunctionBinder binder(context);
vector<unique_ptr<Expression>> create_children;
vector<unique_ptr<Expression>> decode_children;
child_list_t<LogicalType> decode_child_list;
for (idx_t col_idx = 0; col_idx < orders.size(); col_idx++) {
const auto &order = orders[col_idx];
// Create: for each column we have two arguments: 1. the column, 2. sort specifier
create_children.emplace_back(order.expression->Copy());
create_children.emplace_back(make_uniq<BoundConstantExpression>(Value(order.GetOrderModifier())));
// Avoid having unnamed structs fields (otherwise we get a parser exception while binding)
const auto col_name = StringUtil::Format("c%llu", col_idx);
auto col_type = order.expression->return_type;
decode_child_list.emplace_back(col_name, col_type);
col_type = TypeVisitor::VisitReplace(col_type, [](const LogicalType &type) {
if (type.id() != LogicalTypeId::STRUCT) {
return type;
}
child_list_t<LogicalType> internal_child_list;
for (const auto &child : StructType::GetChildTypes(type)) {
internal_child_list.emplace_back(StringUtil::Format("c%llu", internal_child_list.size()), child.second);
}
return LogicalType::STRUCT(std::move(internal_child_list));
});
// Decode: for each column we have two arguments: 1. col name + type, 2. sort specifier
decode_children.emplace_back(make_uniq<BoundConstantExpression>(Value(col_name + " " + col_type.ToString())));
decode_children.emplace_back(make_uniq<BoundConstantExpression>(order.GetOrderModifier()));
}
ErrorData error;
create_sort_key = binder.BindScalarFunction(DEFAULT_SCHEMA, "create_sort_key", std::move(create_children), error);
if (!create_sort_key) {
throw InternalException("Unable to bind create_sort_key in Sort::Sort");
}
switch (create_sort_key->return_type.id()) {
case LogicalTypeId::BIGINT:
decode_children.insert(decode_children.begin(),
make_uniq<BoundReferenceExpression>(LogicalType::BIGINT, static_cast<storage_t>(0)));
break;
default:
D_ASSERT(create_sort_key->return_type.id() == LogicalTypeId::BLOB);
decode_children.insert(decode_children.begin(),
make_uniq<BoundReferenceExpression>(LogicalType::BLOB, static_cast<storage_t>(0)));
}
decode_sort_key = binder.BindScalarFunction(DecodeSortKeyFun::GetFunction(), std::move(decode_children));
if (!decode_sort_key) {
throw InternalException("Unable to bind decode_sort_key in Sort::Sort");
}
// A bit hacky, but this way we make sure that the output does contain the unnamed structs again
decode_sort_key->return_type = LogicalType::STRUCT(std::move(decode_child_list));
// For convenience, we fill the projection map if it is empty
if (projection_map.empty()) {
projection_map.reserve(input_types.size());
for (idx_t col_idx = 0; col_idx < input_types.size(); col_idx++) {
projection_map.push_back(col_idx);
}
}
// We need to output this many columns, reserve
output_projection_columns.reserve(projection_map.size());
// Create mapping from input column to key (so we won't duplicate columns in key/payload)
unordered_map<idx_t, idx_t> input_column_to_key;
for (idx_t key_idx = 0; key_idx < orders.size(); key_idx++) {
const auto &key_order_expr = *orders[key_idx].expression;
if (key_order_expr.GetExpressionClass() == ExpressionClass::BOUND_REF) {
input_column_to_key.emplace(key_order_expr.Cast<BoundReferenceExpression>().index, key_idx);
}
}
// Construct payload layout (excluding columns that also appear as key)
vector<LogicalType> payload_types;
for (idx_t output_col_idx = 0; output_col_idx < projection_map.size(); output_col_idx++) {
const auto &input_col_idx = projection_map[output_col_idx];
const auto it = input_column_to_key.find(input_col_idx);
if (it != input_column_to_key.end()) {
// Projected column also appears as a key, just reference it
output_projection_columns.push_back({false, it->second, output_col_idx});
} else {
// Projected column does not appear as a key, add to payload layout
output_projection_columns.push_back({true, payload_types.size(), output_col_idx});
payload_types.push_back(input_types[input_col_idx]);
input_projection_map.push_back(input_col_idx);
}
}
payload_layout->Initialize(payload_types, TupleDataValidityType::CAN_HAVE_NULL_VALUES);
// Sort the output projection columns so we're gathering the columns in order
std::sort(output_projection_columns.begin(), output_projection_columns.end(),
[](const SortProjectionColumn &lhs, const SortProjectionColumn &rhs) {
if (lhs.is_payload == rhs.is_payload) {
return lhs.layout_col_idx < rhs.layout_col_idx;
}
return lhs.is_payload < rhs.is_payload;
});
// Finally, initialize the key layout (now that we know whether we have a payload)
key_layout->Initialize(orders, create_sort_key->return_type, !payload_types.empty());
}

最后单独生成key_layout

key_layout->Initialize(orders, create_sort_key->return_type, !payload_types.empty());

tuple_data_layout.cpp

根据参与排序键的属性,生成sort_width

对于Q1,这里的sort_width的最终结果为4(2+2),GetTypeIdSize(physical_type)得到的是1

if (TypeIsConstantSize(physical_type)) {
// NULL byte + fixed-width type
sort_width += 1 + GetTypeIdSize(physical_type);
} else if (logical_type == LogicalType::VARCHAR && order.stats &&
StringStats::HasMaxStringLength(*order.stats)) {
// NULL byte + maximum string length + string delimiter
sort_width += 1 + StringStats::MaxStringLength(*order.stats) + 1;
} else {
// We don't know how long the key will be
sort_width = DConstants::INVALID_INDEX;
break;
}

根据row_width选择SortKetyType,对于Q1,因为LogicalTypeLogicalTypeId::BIGINT,所以初始值为8,因为有Payload所以再加8,最后sort_width 是16,由于有Payload,选择SortKeyType::PAYLOAD_FIXED_16

idx_t temp_row_width = type.id() == LogicalTypeId::BIGINT ? 8 : sort_width;
if (sort_width != DConstants::INVALID_INDEX && has_payload) {
temp_row_width += 8;
}
if (temp_row_width <= 8) {
D_ASSERT(!has_payload);
row_width = 8;
sort_key_type = SortKeyType::NO_PAYLOAD_FIXED_8;
} else if (temp_row_width <= 16) {
row_width = 16;
sort_key_type = has_payload ? SortKeyType::PAYLOAD_FIXED_16 : SortKeyType::NO_PAYLOAD_FIXED_16;
} else if (temp_row_width <= 24) {
row_width = 24;
sort_key_type = has_payload ? SortKeyType::PAYLOAD_FIXED_24 : SortKeyType::NO_PAYLOAD_FIXED_24;
} else if (temp_row_width <= 32) {
row_width = 32;
sort_key_type = has_payload ? SortKeyType::PAYLOAD_FIXED_32 : SortKeyType::NO_PAYLOAD_FIXED_32;
} else {
row_width = 32;
sort_key_type = has_payload ? SortKeyType::PAYLOAD_VARIABLE_32 : SortKeyType::NO_PAYLOAD_VARIABLE_32;
// Variable-size sort key, also set these properties
all_constant = false;
heap_size_offset = has_payload ? SortKey<SortKeyType::PAYLOAD_VARIABLE_32>::HEAP_SIZE_OFFSET
: SortKey<SortKeyType::NO_PAYLOAD_VARIABLE_32>::HEAP_SIZE_OFFSET;
}

sorted_run.cpp

在Sort算子的Sink阶段(可以理解为算子的执行与输出)如果有Payload,则需要给Payload设置相对应的Pointer

void SortedRun::Sink(DataChunk &key, DataChunk &payload) {
D_ASSERT(!finalized);
key_data->Append(key_append_state, key);
if (payload_data) {
D_ASSERT(key.size() == payload.size());
payload_data->Append(payload_append_state, payload);
SetPayloadPointer(key_append_state.chunk_state.row_locations, payload_append_state.chunk_state.row_locations,
key.size(), key_data->GetLayout().GetSortKeyType());
}
}

通过SetPayloadPointer设置相对应的指针,根据sort_key_type找到相对应的模板生成(对于Q1,这里的count为4)

static void SetPayloadPointer(Vector &key_locations, Vector &payload_locations, const idx_t count,
const SortKeyType &sort_key_type) {
switch (sort_key_type) {
case SortKeyType::PAYLOAD_FIXED_16:
return TemplatedSetPayloadPointer<SortKeyType::PAYLOAD_FIXED_16>(key_locations, payload_locations, count);
case SortKeyType::PAYLOAD_FIXED_24:
return TemplatedSetPayloadPointer<SortKeyType::PAYLOAD_FIXED_24>(key_locations, payload_locations, count);
case SortKeyType::PAYLOAD_FIXED_32:
return TemplatedSetPayloadPointer<SortKeyType::PAYLOAD_FIXED_32>(key_locations, payload_locations, count);
case SortKeyType::PAYLOAD_VARIABLE_32:
return TemplatedSetPayloadPointer<SortKeyType::PAYLOAD_VARIABLE_32>(key_locations, payload_locations, count);
default:
throw NotImplementedException("SetPayloadPointer for %s", EnumUtil::ToString(sort_key_type));
}
}

可以看到模板设置的是指针

template <SortKeyType SORT_KEY_TYPE>
static void TemplatedSetPayloadPointer(Vector &key_locations, Vector &payload_locations, const idx_t count) {
using SORT_KEY = SortKey<SORT_KEY_TYPE>;
const auto key_locations_ptr = FlatVector::GetData<SORT_KEY *>(key_locations);
const auto payload_locations_ptr = FlatVector::GetData<data_ptr_t>(payload_locations);
for (idx_t i = 0; i < count; i++) {
key_locations_ptr[i]->SetPayload(payload_locations_ptr[i]);
}
}

SortSwitch判断KeyType类型,从模板转入对应Sort方法

static void SortSwitch(const TupleDataCollection &key_data, bool is_index_sort) {
const auto sort_key_type = key_data.GetLayout().GetSortKeyType();
switch (sort_key_type) {
case SortKeyType::NO_PAYLOAD_FIXED_8:
return TemplatedSort<SortKeyType::NO_PAYLOAD_FIXED_8>(key_data, is_index_sort);
case SortKeyType::NO_PAYLOAD_FIXED_16:
return TemplatedSort<SortKeyType::NO_PAYLOAD_FIXED_16>(key_data, is_index_sort);
case SortKeyType::NO_PAYLOAD_FIXED_24:
return TemplatedSort<SortKeyType::NO_PAYLOAD_FIXED_24>(key_data, is_index_sort);
case SortKeyType::NO_PAYLOAD_FIXED_32:
return TemplatedSort<SortKeyType::NO_PAYLOAD_FIXED_32>(key_data, is_index_sort);
case SortKeyType::NO_PAYLOAD_VARIABLE_32:
return TemplatedSort<SortKeyType::NO_PAYLOAD_VARIABLE_32>(key_data, is_index_sort);
case SortKeyType::PAYLOAD_FIXED_16:
return TemplatedSort<SortKeyType::PAYLOAD_FIXED_16>(key_data, is_index_sort);
case SortKeyType::PAYLOAD_FIXED_24:
return TemplatedSort<SortKeyType::PAYLOAD_FIXED_24>(key_data, is_index_sort);
case SortKeyType::PAYLOAD_FIXED_32:
return TemplatedSort<SortKeyType::PAYLOAD_FIXED_32>(key_data, is_index_sort);
case SortKeyType::PAYLOAD_VARIABLE_32:
return TemplatedSort<SortKeyType::PAYLOAD_VARIABLE_32>(key_data, is_index_sort);
default:
throw NotImplementedException("TemplatedSort for %s", EnumUtil::ToString(sort_key_type));
}
}

TemplatedSort当中,则默认使用vergesort,fallback时会选择ska_sort

const auto ska_sort_width = MinValue<idx_t>(layout.GetSortWidth(), sizeof(uint64_t));

对于Q1,这里的ska_sort_width为4,为计算前面的sort_width所得到的

完整代码如下:

auto begin = BLOCK_ITERATOR(state, 0);
auto end = BLOCK_ITERATOR(state, key_data.Count());
const auto requires_next_sort =
is_index_sort ? false : !SORT_KEY::CONSTANT_SIZE || SORT_KEY::INLINE_LENGTH != sizeof(uint64_t);
const auto ska_sort_width = MinValue<idx_t>(layout.GetSortWidth(), sizeof(uint64_t));
const auto &sort_skippable_bytes = layout.GetSortSkippableBytes();
auto ska_extract_key =
SkaExtractKey<SORT_KEY>(requires_next_sort, ska_sort_width, sort_skippable_bytes, context.interrupted);
const auto fallback = [ska_extract_key](const BLOCK_ITERATOR &fb_begin, const BLOCK_ITERATOR &fb_end) {
duckdb_ska_sort::ska_sort(fb_begin, fb_end, ska_extract_key);
};
duckdb_vergesort::vergesort(begin, end, std::less<SORT_KEY>(), fallback);

vergesort.h

对于verge_sort小于128个元素的数组,使用回退算法

而Q1需要排序的Aggregate分组只有4个(小于阈值24),所以直接fallback到ska_sort

if (dist < 128) {
// Vergesort is inefficient for small collections
fallback(first, last);
return;
}

ska_sort.hpp

ska_sort内部则是inplace_radix_sort

template<typename It, typename ExtractKey>
static void ska_sort(It begin, It end, ExtractKey && extract_key)
{
detail::inplace_radix_sort<128, 1024>(begin, end, extract_key);
}

对于Q1会掉入StdSortFallbacks,这里的fallback被设置成为了pdqsort_branchless,同时配置好comp函数

template<typename It, typename ExtractKey>
inline void StdSortFallback(It begin, It end, ExtractKey & extract_key)
{
// LNK note that we use the full comparison (not just extracted key) here
static const auto comp = [&](const typename std::remove_reference<decltype(*begin)>::type & l, const typename std::remove_reference<decltype(*begin)>::type & r){ return l < r; };
static const auto fallback = [&](const It &fb_begin, const It &fb_end) {
duckdb_pdqsort::pdqsort_branchless(fb_begin, fb_end, comp);
};
duckdb_vergesort::vergesort(begin, end, comp, fallback);
}

调用堆栈如下图

image-20250920205541241

pdq_sort.h

Q1需要排序的Aggregate分组只有4个(小于阈值24),进入pdqsort_detail::pdqsort_loop

template<class Iter, class Compare>
inline void pdqsort_branchless(Iter begin, Iter end, Compare comp) {
if (begin == end) return;
pdqsort_detail::pdqsort_loop<Iter, Compare, true>(
begin, end, comp, pdqsort_detail::log2(end - begin));
}

由于再次小于阈值24,所以进入insertion_sort

if (size < insertion_sort_threshold) {
if (leftmost) insertion_sort(begin, end, comp);
else unguarded_insertion_sort(begin, end, comp);
return;
}

C++

template<class Iter, class Compare>
inline void insertion_sort(Iter begin, Iter end, Compare comp) {
typedef typename std::iterator_traits<Iter>::value_type T;
if (begin == end) return;
for (Iter cur = begin + 1; cur != end; ++cur) {
Iter sift = cur;
Iter sift_1 = cur - 1;
// Compare first so we can avoid 2 moves for an element already positioned correctly.
if (comp(*sift, *sift_1)) {
T tmp = PDQSORT_PREFER_MOVE(*sift);
do { *sift-- = PDQSORT_PREFER_MOVE(*sift_1); }
while (sift != begin && comp(tmp, *--sift_1));
*sift = PDQSORT_PREFER_MOVE(tmp);
}
}
}

DuckDB v1.4.0之前

我调试使用的是v1.3.2版本

状态管理默认的进入入口会是LocalSortState::SortInMemory()

对于这个版本,Sort会直接调用src/common/sort/radix_sort.cpp中的RadixSort

有意思的地方在于,虽然l_returnflag和l_linestatus是varchar,但contains_string这里值为false——实际调试显示l_returnflag和varchar被优化为Uint8

以及,虽然函数的名称为RadixSort,但也会根据排序组数选择不同算法

同样因为只有4组,小于阈值24,所以还是InsertationSort——但这个应该是Laurens Kuiper参考原版的PDQSort自己写的

if (count <= SortConstants::INSERTION_SORT_THRESHOLD) {
return InsertionSort(dataptr, nullptr, count, col_offset, sort_layout.entry_size, sorting_size, 0, false);
}

InsertationSort内部

inline void InsertionSort(const data_ptr_t orig_ptr, const data_ptr_t temp_ptr, const idx_t &count,
const idx_t &col_offset, const idx_t &row_width, const idx_t &total_comp_width,
const idx_t &offset, bool swap) {
const data_ptr_t source_ptr = swap ? temp_ptr : orig_ptr;
const data_ptr_t target_ptr = swap ? orig_ptr : temp_ptr;
if (count > 1) {
const idx_t total_offset = col_offset + offset;
auto temp_val = make_unsafe_uniq_array_uninitialized<data_t>(row_width);
const data_ptr_t val = temp_val.get();
const auto comp_width = total_comp_width - offset;
for (idx_t i = 1; i < count; i++) {
FastMemcpy(val, source_ptr + i * row_width, row_width);
idx_t j = i;
while (j > 0 &&
FastMemcmp(source_ptr + (j - 1) * row_width + total_offset, val + total_offset, comp_width) > 0) {
FastMemcpy(source_ptr + j * row_width, source_ptr + (j - 1) * row_width, row_width);
j--;
}
FastMemcpy(source_ptr + j * row_width, val, row_width);
}
}
if (swap) {
memcpy(target_ptr, source_ptr, count * row_width);
}
}

DuckDB版RadixSort完整代码如下

void RadixSort(BufferManager &buffer_manager, const data_ptr_t &dataptr, const idx_t &count, const idx_t &col_offset,
const idx_t &sorting_size, const SortLayout &sort_layout, bool contains_string) {
if (contains_string) {
auto begin = duckdb_pdqsort::PDQIterator(dataptr, sort_layout.entry_size);
auto end = begin + count;
duckdb_pdqsort::PDQConstants constants(sort_layout.entry_size, col_offset, sorting_size, *end);
return duckdb_pdqsort::pdqsort_branchless(begin, begin + count, constants);
}
if (count <= SortConstants::INSERTION_SORT_THRESHOLD) {
return InsertionSort(dataptr, nullptr, count, col_offset, sort_layout.entry_size, sorting_size, 0, false);
}
if (sorting_size <= SortConstants::MSD_RADIX_SORT_SIZE_THRESHOLD) {
return RadixSortLSD(buffer_manager, dataptr, count, col_offset, sort_layout.entry_size, sorting_size);
}
const auto block_size = buffer_manager.GetBlockSize();
auto temp_block =
buffer_manager.Allocate(MemoryTag::ORDER_BY, MaxValue(count * sort_layout.entry_size, block_size));
auto pre_allocated_array =
make_unsafe_uniq_array_uninitialized<idx_t>(sorting_size * SortConstants::MSD_RADIX_LOCATIONS);
RadixSortMSD(dataptr, temp_block.Ptr(), count, col_offset, sort_layout.entry_size, sorting_size, 0,
pre_allocated_array.get(), false);
}

性能对比参照

New Sorting Implementation

这是DuckDB关于他们最新Sort的Pull Request解释,该版本用于最新的DuckDB V1.4.0中的order by——他们在PR中说明将会替换原有排序方式,并且在测试中有2倍以上的性能提升

TableColumn Type(s)Rows [Millions]Current [s]New [s]Speedup [x]
Ascending1 UBIGINT100.1100.0333.333
Ascending1 UBIGINT1000.9120.1815.038
Ascending1 UBIGINT100015.3021.47510.374
Descending1 UBIGINT100.1210.0343.558
Descending1 UBIGINT1000.9080.2074.386
Descending1 UBIGINT100015.7891.7129.222
Random1 UBIGINT100.1200.0941.276
Random1 UBIGINT1001.0280.5871.751
Random1 UBIGINT100017.5546.4932.703
TPC-H SF1 l_comment1 VARCHAR~60.8480.2962.864
TPC-H SF 10 l_comment1 VARCHAR~608.4653.0902.739
TPC-H SF 100 l_comment1 VARCHAR~600300+35.1878.525+
TPC-H SF 1 lineitem by l_shipdate15 Mixed~60.3280.1891.735
TPC-H SF 10 lineitem by l_shipdate15 Mixed~603.3531.5202.205
TPC-H SF 100 lineitem by l_shipdate15 Mixed~600273.98280.9193.385

参考资料

第一个是DuckDB的官方博客,2和3是知乎上的分析,4是关于DuckDB Sort方法的论文(可以看到核心是PDQSort和RadixSort)——这些分析针对的是v1.4.0之前的情况

预计2025年9-10月DuckDB社区会有Blog说明v1.4.0的Sort的情况

Fastest Table Sort in the West – Redesigning DuckDB’s Sort

DuckDB的变长Sort实现

DuckDB Sort代码阅读和分析

ICDE2023-sorting.pdf

2021年Laurens Kuiper做的Sort重构:Rework physical ORDER BY

AI辅助解释

TupleDataLayout

TupleDataLayout 类详细解释

TupleDataLayout 是 DuckDB 中用于管理行数据布局的核心类,它定义了如何在内存中组织和存储元组(行)数据。

核心功能

1. 数据组织结构

TupleDataLayout 管理行数据的内存布局,包含以下组件:

  • 有效性标志区域 (flag_width): 存储NULL值标记
  • 数据区域 (data_width): 存储实际列数据
  • 聚合状态区域 (aggr_width): 存储聚合函数状态
  • 排序键区域 (sort_width): 存储排序相关数据

2. 主要特性

类型管理

vector<LogicalType> types; // 列类型
Aggregates aggregates; // 聚合函数
  • 支持多种数据类型的列
  • 管理聚合函数对象
  • 提供类型安全的访问接口

嵌套结构支持

unique_ptr<unordered_map<idx_t, TupleDataLayout>> struct_layouts;
  • 支持结构体类型的递归布局
  • 每个结构体列都有自己的 TupleDataLayout

内存优化

bool all_constant; // 是否所有列都是定长
vector<idx_t> variable_columns; // 变长列索引
idx_t heap_size_offset; // 堆大小偏移
  • 区分定长和变长列以优化内存使用
  • 支持堆内存管理用于变长数据

初始化方法

1. 完整初始化

void Initialize(vector<LogicalType> types_p, Aggregates aggregates_p,
TupleDataValidityType validity_type,
TupleDataNestednessType nestedness_type);

2. 仅类型初始化

void Initialize(vector<LogicalType> types, TupleDataValidityType validity_type,
TupleDataNestednessType nestedness_type);

3. 仅聚合初始化

void Initialize(Aggregates aggregates_p);

4. 排序布局初始化

void Initialize(const vector<BoundOrderByNode> &orders, const LogicalType &type,
bool has_payload);

访问器方法

基本信息获取

  • ColumnCount(): 返回列数量
  • AggregateCount(): 返回聚合函数数量
  • GetTypes(): 获取列类型列表
  • GetAggregates(): 获取聚合函数列表

布局信息

  • GetRowWidth(): 整行宽度(包含填充)
  • GetDataOffset(): 数据区域起始偏移
  • GetDataWidth(): 数据区域宽度
  • GetAggrOffset(): 聚合区域起始偏移
  • GetAggrWidth(): 聚合区域宽度
  • GetOffsets(): 各列在行中的偏移

特殊属性检查

  • AllConstant(): 检查是否所有列都是定长
  • AllValid(): 检查是否所有列都不含NULL值
  • HasDestructor(): 检查聚合函数是否有析构器
  • IsSortKeyLayout(): 检查是否为排序键布局

排序支持

SortKeyType sort_key_type;
idx_t sort_width;
vector<idx_t> sort_skippable_bytes;
  • 支持不同类型的排序键
  • 提供排序优化,包括可跳过的字节
  • GetSortWidth()GetSortSkippableBytes() 用于排序性能优化

内存管理特性

有效性管理

TupleDataValidityType validity_type;

支持不同的NULL值处理策略:

  • 完全不允许NULL值
  • 支持NULL值标记

资源清理

vector<idx_t> aggr_destructor_idxs;
  • 跟踪需要析构的聚合函数索引
  • 确保资源正确释放

使用场景

  1. 查询执行: 在各种算子中管理行数据布局
  2. 聚合操作: 存储和管理聚合函数状态
  3. 排序操作: 优化排序键的内存布局
  4. 数据交换: 在不同算子间传递结构化数据
  5. 内存优化: 最小化内存占用和缓存未命中

TupleDataLayout 是 DuckDB 高性能查询执行的关键组件,通过精确控制数据在内存中的布局来提升查询性能。

结语

对于Q1而言,DuckDB使用的Sort的本质用的还是插入排序InsertationSort

其速度提升主要来源于流程的优化——正如PR里所说的从ROwDataCollection转为了使用延迟指针(lazy pointer)TupleDataCollection,减轻数据溢出,降低合并成本