Umbra: A New Database Solution for JIT Design

Paper name: Umbra: A Disk-Based System with In-Memory Performance

Address：https://cidrdb.org/cidr2020/papers/p29-neumann-cidr20.pdf

Website：https://umbra-db.com/

This paper was published in CIDR 2020 and is authored by TUM, the team that previously made Hyper Data, which was sold to Tableau, and then went on to develop Umbra.

Technical Background

Umbra is based on Hyper, a closed-source in-memory HTAP database designed to be more in-memory, while Umbra is designed to interact with in-memory (Disk-Based with In-Memory Performance) in the context of the popularity of SSDs, which are coming down in price.

Implementation details

Buffer Management manages different page sizes, with each size class having twice the page size of the previous one, down to 64 KB.

The use of variable-length pages poses a challenge to Buffer Pool Memory Management. Umbra solves this problem by using the virtual and physical address mapping provided by the operating system, using pread to read the disk page into the buffer frame, and the operating system to allocate the physical memory (which may be non-contiguous). Create a mapping of virtual and physical addresses, write the buffer frame to a disk file with pwrite on release, and pass the MADV_DONTNEED flag with madvise to allow the OS to reclaim the physical memory

To serialise pages to disk, Umbra uses Pointer Swizzling, which is the process of converting name- or location-based references to direct memory addresses during serialisation and deserialisation, thus avoiding the locking conflicts associated with accessing pages using a global hash table.

For string storage, Umbra proposed what Andy Pavlo of CMU called the ‘Germen Style String’ scheme, which takes advantage of the characteristics of the variable-length page, which can be simply stored as length + data, and for the header of the page, the first 4 bits are used to store the string length, and the next 12 bits are used to store the string according to its length. For the header, the first 4 bits are used to store the length of the string, and the last 12 bits are encoded in two ways, depending on the length of the string.

For short strings not longer than 12 bytes, their data is stored directly in the last 12 bits of the header.
For short strings up to 12 bytes long, their data is stored directly in the last 12 bits of the header. 4 of the last 12 bits are used to store the first four characters of the string, allowing Umbra to do some quick string comparisons, and the remaining 8 bits store a pointer to their data or an offset to a known location

As for the execution engine, Umbra uses JIT to pipeline and modularise the execution plan into efficient machine code, making it easier to interrupt in the face of high IO pressure, as well as better parallelising the execution of the physical plan

Take the following TPC-H query as an example, Umbra will use two Pipelines to execute it, the first Pipeline scans the supplier table and performs a group by operation, the second Pipeline scans the data of each group and prints the query output. In Umbra, these Pipelines are further broken down into steps, each of which can be either single-threaded or multi-threaded

The generated machine code also provides a variety of options, Umbra uses an adaptive compilation scheme, weighing the compilation and execution time, you can generate the LLVM IR, you can also generate Umbra’s custom IR, if the interpretation of the execution of the faster directly interpreted, and vice versa, then compiled before executed

Experimental results

The article tested two Benchmark 10GB data for TPC-H and JOB, with five repetitions of each query, and the fastest repetitions (i.e., after full warm-up) were selected. Compared to Hyper, Umbra’s performance is significantly better, mainly due to the adaptive IR compilation. In particular, Umbra’s geometric mean improves to 3.0× on JOB and 1.8× on TPC-H. In these queries, HyPer actually spends much more time on query compilation than on query execution, up to 29×.

And in pure computational overhead, Umbra’s execution time fluctuates about 30% on JOB and 10% on TPC-H

Conclusion

This article is very informative, with innovations in memory management, transaction concurrency, data serialisation and deserialisation, and finally the execution engine.

For the most critical execution engine, the proposed adaptive compilation is very interesting: for queries with small amounts of data, direct execution is faster than compiling and then executing - and even if compilation is required, it can be a fast, unoptimised compilation, or a full compilation that takes longer and is better optimised, which is a big trade-off. trade-off is a major challenge, and the resulting increase in software complexity cannot be ignored.