SFU的工作,关于解决DataBase和DataFrame之间的Gap问题

数据一般都是存在DataBase里,但数据科学的人员一般都是用DataFrame

Note: 从VLDB下载的文章似乎有问题——复制粘贴会有乱码

文章:https://dl.acm.org/doi/10.14778/3551793.3551847

Github仓库:https://github.com/sfu-db/connector-x

Introduction

Thus, the first step in most data science applications is to load data from the DBMS. Unfortunately, this data loading process is not only notoriously slow but also consumes inordinate amounts of client memory, which easily leads to out-of-memory errors or performance degradation.

如果是上课用的小数据这不算啥,但如果数据量大的话那可确实是“notoriously slow”,但”Out-of-memory”这就有点搞了😆

bridging the gap between databases and dataframes is of great interest to both academia and industry

真实的,确凿的

This paper describes ConnectorX, a fast and memory-efficient data loading library that supports many DBMSs (e.g., PostgreSQL, SQLite, MySQL, SQLServer, Oracle) to client dataframes (e.g., Pandas, PyArrow, Modin, Dask, and Polars).

这看起来不错👍

image-20250326173204981

从图表可以看出,确实有效降低了读取时间和使用的内存

文章提出了四个问题,这些问题会在后边得到解答

First, where are the actual data loading bottlenecks?

Second, how do we both reduce the runtime and memory, while also making the system extensible to new DBMSs?

Third, are widely used client query partitioning scheme good enough?

Fourth, does a new data loading library matter?

文章指出:Bottlenecks来自服务端和客户端

而对于第二个问题,他们设计了一套DSL解决

In Section 4, we design a succinct domain-specific language (DSL) for mapping DBMS result representations into dataframes—this reduces the lines of code by 1-2 orders of magnitude as compared to not using our DSL.

单独设计一门DSL解决这个问题?😯看看后面怎么说

It is also integrated into popular open source projects such as Polars [17], Modin [61] and DataPrep [14].

我写文章时ConnectX的stars数是2.2k,看来大家是认可这项工作的

Server层面

Arrow-Flight(零拷贝成本完成往Arrow IPC的转换)

各类云厂商的ETL工具

Client层面

Pandas,Modin, Dask, Spark各自有文件读取方案

Integrating DBMS with Data Science

Embedded analytical system DuckDB [64] is developed to avoid the bottlenecks of result set serialization and value-based APIs by making DBMS and analytic tools in the same address space.

DuckDB也是个不错的选项(Server端和Client端不会有太大的传输问题)

AN ANATOMY OF PANDAS.READ_SQL

Where Does the Time Go?

From the client’s perspective, the overall process has three major steps:

(1) Execution + Transfer: Server executes the query and sends the result back to the client through network in bytes following a specific wire protocol.

(2) Deserialization: Client parses the received bytes and returns the query result in the form of Python objects.

(3) Conversion to dataframe: Client converts the Python objects into NumPy [46] arrays and constructs the final dataframe.

列举Client可能出现的造成延时的问题

image-20250326181212276

A surprising finding is that the majority of the time is actually spent on the client side rather than on the query execution

Where Does the Memory Go?

Next, we inspect the memory footprint of running read_sql and show the results in Table 1. Raw Bytes, Python Objects, and Dataframe represent the size of the bytes the client received, the intermediate Python objects, and the final dataframe, respectively.

image-20250326182240237

有意思,虽然知道Raw Bytes转到Dataframe会数据膨胀,但能达到两到三倍,峰值可以到8-10倍确实没想到过

In Python, every object contains a header structure that maintains information like reference count and object’s type in addition to the data itself. This will add some overhead on the size of the data.

Get,文章列举了Python Object有各种数据膨胀

Specically, Pandas.read_sql keeps three copies of the entire data in memory, which are stored in three different formats: Raw Bytes, Python Objects, and Dataframe.

卧槽,Pandas.read_sql居然同时保留三样数据😫Get

How Much Can Chunking Help?

有意思,数据科学的人员遇到这类问题一般都会选择分块解决

We see that chunking is indeed very elective in reducing memory usage because it does not hold all the intermediate results in memory. …… However, it has little help in improving the running time of read_sql

Chunk有助于解决内存问题,但时间确实无法降低,还需要编写额外的代码

CONNECTORX

How to Speed Up?

image-20250326185100279

Learning from chunking, ConnectorX adopts a streaming work￿ow, where the client loads and processes a small batch of data at a time.

流式分块处理

Figure 4 illustrates the overall work￿ow, which consists of two two phases: Preparation Phase ( 1 - 3 ) and Execution Phase ( 4 - 6 ).

准备阶段和执行阶段,Get

ConnectorX automatically sets the range by issuing query SELECT MIN(ID), MAX(ID) FROM Students. Then, ConnectorX equally partitions the range into 3 splits and generate three subqueries

q1: SELECT … FROM Students WHERE ID < 1,000,000

q2: SELECT … FROM Students WHERE ID 2 [1,000,000, 2,000,000)

q3: SELECT … FROM Students WHERE ID 2,000,000

将SQL语句拆成三条进行并行化执行

However, the buffers that the string objects point to have to be allocated on-the-fly after knowing the actual length of each value.

数据库的字符串存储确实是一个问题

Moreover, constructing a string object is not thread-safe in Python. It needs to acquire the Python Global Interpreter Lock (GIL), which could slow down the whole process when the degree of parallelism is large

哦?Python还有这事?Mark

To alleviate this overhead, ConnectorX constructs a batch of strings at a time while acquiring the GIL instead of allocating each string object separately.

通过一次构造一批字符串从而缓解这个问题

We choose Rust since it provides efficient performance and guarantees memory safety. In addition, there is a variety of high-performance client drivers for different databases in Rust that ConnectorX can directly build on

对,ConnectorX是用Rust开发的,这点很棒

How to Extend?

Overall Architecture. ConnectorX consists of three main modules: Source (e.g. PostgreSQL, MySQL), Destination (e.g. Pandas, PyArrow) and a bridge module Type Mapper in the middle to de￿ne how to convert the physical representation for the data from Source to Destination.

image-20250326193158202

这个Type Mapper……看看后面怎么说?

A Type Mapper module consists of a set of rules that specify how to convert data from a specific Source type to a specific Destination type. During runtime, each subquery will be handled by a single thread.

这里的SubQuery没有明说,我猜是不是Bulk的一部分先从Source到Destination

To mitigate the aforementioned issues, ConnectorX de￿nes a domain specific language (DSL) to help the developers define the type mappings, leveraging the modern macro support in Rust

啧,定义了一套DSL用于应对复杂类型转换

Each line consists of three parts: logical type and corresponding physical type of Source, the matched logical type and physical type of Destination, and the conversion implementation choice including auto, none, and option.

看起来怎么那么像MLIR的Value的体系😂

image-20250326204539841

This simple DSL makes the relation of type mapping intuitive and easy to maintain. We found it has helped shorten code related to type mapping by 97% (from 37k to 1k lines of code).

看起来不错,达成DSL所应具备的(不知道DSL实现模板怎么样)

QUERY PARTITIONING

Client-Side Query Partitioning

Figure 7 shows the network utilization of ConnectorX by varying the number of partitions. It is clear that No Partition cannot saturate the network bandwidth at all.

image-20250326205207627

虽然Client-Side Query Partitioning被广泛应用,但会造成数据不一致性和用户负担

Server-Side Result Partitioning

To conclude, server-side partitioning allows the client to fully leverage the network and computation resources

他们在PG服务端上进行了测验,服务端的Partitioning基本不会有什么问题

EVALUATION

使用TPC-H(大小SF10)和一个名为DDoS的数据集

详情可见https://github.com/sfu-db/connectorx-bench

image-20250326212151576

image-20250326212247108

没什么争议,无论是想的,还是实际效果都很棒🤩

论文里还有Ablation Study,我对前面的结果很满意,这里就不放了

思考

这个数据库开发提了个醒:新设计的数据库应当具备良好的转为DataFrame的特性,这点我想也是DuckDB能走红的原因

那个Type Mapper给我一种Prisma的感觉,但细看又像MLIR的Value体系(其实MLIR就是个非常棒的DSL设计体系)