BPF-DB——JIT Implementation of Kernel-bypass for Database Operations
Address:https://dl.acm.org/doi/10.1145/3725272
Andy’s talk video:P99 CONF 2024 | The Next Chapter in the Sordid Love/Hate Relationship Btwn DBs & OSes by Andy Pavlo
This article, published in SIGMOD 2025, is from Andy’s team at CMU, working on a project to implement an embedded KV database through JIT-generated eBPF code.
Technical Background
Kernel-Bypass is a high-performance network data processing technique mainly applied to system I/O intensive scenarios such as DPDK, RDMA and eBPF. These solutions significantly improve data forwarding efficiency by bypassing the lengthy processing of the traditional kernel network stack. In kernel-based transfers, interrupt handling, memory copying, context switching, cache locality invalidation, and memory management mechanisms can create serious performance bottlenecks in highly concurrent scenarios. The Kernel-Bypass scheme avoids these overheads by allowing the application to interact with the hardware directly.
The eBPF (extended Berkeley Packet Filter) is a virtual machine used to run user-defined programs in the operating system kernel in a secure and efficient manner, mainly for network packet events and monitoring in the kernel. Since the eBPF program is executed in the kernel state, it needs to be checked by a strict verifier in the kernel before execution to ensure that the eBPF program will not cause the kernel to crash or have other security issues. The verifier checks that the execution path of the eBPF program is safe from infinite loops and illegal memory accesses. After passing the validation, the eBPF bytecode will be fed into the eBPF virtual machine to run.
Implementation details
The core and key to implementing a KV database using eBPF is the eBPF Maps structure, which is used for user-space and kernel-space data interaction.
The Linux kernel specifies a maximum Key size of 512 bytes and a maximum value size of 4096 bytes (4KB), so although BPF-DB implements a KV database, there is a limit to the size of the value: the maximum value value cannot exceed 1KB
The core functionality of BPF-DB is implemented in kernel-space, thus reducing the performance loss due to the interaction between Kernel-space and User space.
And thanks to the Kernel-Bypass feature, BPF-DB can also communicate directly through the network stack in the Linux kernel, thus reducing the latency caused by network communication.
Experimental results
BPF-DB implements a transactional mechanism based on MVCC, and a significant improvement occurs only in the p99 case (long-tailed latency, 99% please fast, only 1% slow), which is partly due to version chain scanning overhead, GC overhead, and increased locking conflict
BPF-DB has good scalability and throughput increases logarithmically as the number of threads grows
With the implementation of the above mechanism, there is a substantial performance improvement compared to Redis
Conclusion
Andy was very persistent on the matter of Kernel-Bypass, both in his CMU15-721 class and in his external presentations, and repeatedly emphasised that databases should not be dependent on operating system mechanisms, but should be independently in control of the control about memory and hard disk storage, and so the CMU database group has been researching the use of Kernel-Bypass in databases, and this article does give good ideas for developing databases in the Linux kernel.
The downside is that eBPF has a lot of storage limitations, and its procedures can’t request stack memory, can’t pass references or pointers, doesn’t support recursive data structures, and can’t allocate memory dynamically. Andy wants to implement SQL in eBPF, but it should be more difficult to do so than we thought.