这篇文章来自威斯康星麦迪逊的团队,应该也是最早提出PAX存储格式的论文。PAX的全称是Partition Attribute Across,关于PAX存储格式,Andy在上CMU-15445的时候有提到过,如果只是单纯想了解下PAX,那看这门课就够了(PAX相比较于NSM与DSM,是一种适用于HTAP的方案)

image-20250405085243430

另外,OceanBase的MiniOB也实现了一个PAX存储模式的方案,感兴趣想深入的可以看看

image-20250405090151457

论文地址:https://www.vldb.org/conf/2001/P169.pdf

Introduction

The communication between the CPU and the secondary storage (I/O) has been traditionally recognized as the major database performance bottleneck.

这应该也是数据库长期面对的问题——但内存的速度提升与价格降低缓解了这个问题

When running commercial database systems on a modern processor, data requests that miss in the cache hierarchy (i.e., requests for data that are not found in any of the caches and are transferred from main memory) are a key memory bottleneck

缓存无法命中是数据库慢的主要原因

Implementing PAX on a DBMS that uses NSM requires only changes to the page-level data manipulation code.

听起来很Nice,但似乎没人这么做?😂

Research in computer architecture, compilers, and database systems has focused on optimizing data placement for cache performance.

时至今日好像依旧如此?😂

Figure 1 depicts an NSM page after inserting four records of a relation R with three attributes: SSN, name, and age. Each record has a record header (RH) containing a null bitmap, offsets to the variable-length values, and other implementation-specific information.

image-20250405095322197

回顾了下NSM的历史,并且明确其在一部分情况下会造成缓存未命中

image-20250405100040529

回顾了下DSM的历史,这里把列成为sub-relation

Unfortunately, DSM’s performance deteriorates significantly for queries that involve multiple attributes from each participating relation.

这个情况在现在应该不是问题了:Parquet,ORC这些都会增加缓存应对这种情况

PAX

image-20250405100847196

PAX partitions each page into n minipages. It then stores values of the first attribute in the first minipage, values of the second attribute in the second minipage, and so on. The page header at the beginning of each page contains pointers to the beginning of each minipage.

image-20250405101228301

• Fixed-length attribute values are stored in F-minipages. At the end of each F-minipage there is a presence bit vector with one entry per record that denotes null values for nullable attributes.

• Variable-length attribute values are stored in Vminipages. V-minipages are slotted, with pointers to the end of each value. Null values are denoted by null pointers.

定长的与不定长的分开存储

Table 1 summarizes the characteristics of NSM, DSM, and PAX, demonstrating the tradeoff between interrecord spatial locality and record reconstruction cost.

image-20250405101834129

这里补充下Inter-record spatial locality和Low record reconstruction cost

前者是不同记录在物理存储上的间距,影响批量读取效率

后者是还原完整记录的成本,影响还原单条记录效率

System Implementation

In the current implementation, TPC-H tables stored using PAX need 8% less space than those stored using NSM.

要比TPC-H Table的空间占用,跟NSM比PAX肯定有优势,但肯定比不过DSM

Data Manipulation Algorithms

BULK-LOADING AND INSERTIONS

In the case of variable-length attributes, it initially uses a hint (either from the DBMS or the average length from the first few records) as an indication of the average value size, and then uses feedback from the actual size of inserted values to adjust the average value size.

相当于使用元数据对变长的类型进行调整

UPDATES

Updates on variable-length attributes may stretch or shrink the record; page reorganizations may be needed to accomodate the change and the slot table must be updated. …… If the new value is longer than the space available in the V-minipage, the V-minipage borrows space from a neighboring minipage

没太理解,似乎Update需要先读取出来在写回去?

DELETIONS

The NSM deletion algorithm uses the slot array to mark deleted records, and the free space can be filled upon future insertions.

使用Bitmap标记删除数据,并适当重组小型内容(reorganizes minipage contents)以减少碎片化

Query Operators

A scan operator that supports sargable predicates [28] was implemented on top of Shore.

支持Scan下推

The adaptive dynamic hash join algorithm [23], which is also used in DB2 [20], was implemented on top of Shore.

似乎用了和DB2差不多的Join方案

Analysis of the impact of data placement

Dell 6400 PII Xeon/MT system running Windows NT 4.0. This computer features a Pentium II Xeon processor running at 400MHz, 512MB of main memory, and a 100MHz system bus. The processor has split 16-KB first-level (L1) data and instruction caches and a unified 512-KB second-level (L2) cache.

好老的机子😂现在看起来有些不可思议

PAX primarily targets optimizing data cache behavior, and does not affect I/O performance in any way

这里的参数类型没有告知硬盘——在2001年还没有SSD

The graph shows that(Figure 6), while the performance of the NSM and PAX schemes are relatively insensitive to the changes, DSM’s performance is very sensitive to the number of attributes used in the query.

image-20250405105823403

NSM vs. PAX Impact on Cache Behavior

从数据来看效果不错,提高了内存的命中率,降低了CPU使用时间

image-20250405110555250

NSM/PAX Sensitivity Analysis

image-20250405111904214

In these experiments, DSM’s performance is about a factor of nine slower than NSM and PAX. As the number of attributes involved in the query increases, DSM must join the corresponding number of sub-relations.

这部分内容图中没画出来,也就是说,随着查询数量增加,DSM方式会慢上9倍,且主要原因是无法利用上缓存

Evaluation Using DSS Workloads

跑TPC-H,DSS这里是指Decision Support

We conducted experiments on the system described in Section 5.1, using a 128-MB buffer pool and a 64-MB hash join heap.

哎,这Buffer好小😂

image-20250405111954922

Figure 10 compares the elapsed time required to load a 100-MB, 200-MB, and 500-MB TPC-H database with each of the three storage organizations.

也就是SF0.1,SF0.2与SF0.5?emmm😅

Bulk-loading

没调整算法前与NSM相比有2%-10%的性能损失

image-20250405114636880

As an alternative, we implemented a smarter algorithm that reorganizes minipages only if the free space on the page is more than 5%

修改了reogranize minipage的算法后,性能损失可以下降到0.8%

image-20250405114838604

Queries

可以看到性能都有10%左右的提升,但似乎数据量越大性能提升越不明显(我猜应该是能使用的Cache占比变小了)

The speedups obtained, however, are not constant across the experiments due to a combination of differing amounts of I/O and interactions between the hardware and the algorithms being used.

image-20250405115131844

Since query 14 accesses fewer attributes and requires less computation than query 12, PAX outperforms NSM by only 6-32% when running this query

Q14相较于Q12由于访问的属性较少,所以性能提升并不明显

Updates

这我感觉不太好评价,Updates的性能会随着Attribute的增多而下降,而保底能达到10%的提升

image-20250405120052202

When executing updates PAX is always faster than NSM, providing a speedup of 10-16%. The speedup is a function of the fraction of the record accessed as well as the selectivity.

受益于Select加速,Update也能从中获得性能提升

总结

这篇论文发表于2001年,但在2025年的当下,主流的数据库大部分都是DSM,而非PAX,我猜原因可能是随着硬件性能的提升(SSD与DDR5内存),人们更看重存粹的OLAP性能,导致PAX叫好不叫座

而这篇文章更多的强调缓存Buffer对数据的影响。如果Buffer足够大,DSM的性能是否能与PAX持平?

而现代的存储,或多或少由也有收到PAX的影响(比如Apache Parquet),所以很难评价PAX这个方案到底是好还是不好

威斯康星麦迪逊还有另外一篇有关PAX的论文,如果后续有时间会考虑看下