VLDB19-Parsing Gigabytes of JSON per Second论文阅读

simdjson库的理论实现论文

仓库地址:simdjson/simdjson

论文下载地址(Arxiv):Parsing Gigabytes of JSON per Second

选取的版本是V7,更新于Tue, 23 Jul 2024 21:56:05 UTC

摘抄节选

1 Introduction

We can use a quarter or fewer instructions than a state-of-the-art reference parser like RapidJSON. Unlike other validating parsers, our software (simdjson) makes extensive use of Single Instruction, Multiple Data (SIMD) instructions.

据我所知,simdjson确实是当前处理json的最快方案,从2019至今(2024)一直如此。让我好奇的地方在于他们如何利用好SIMD的特性进行工作的。

JSON has four primitive types or atoms (string, number, Boolean, null) that can be embedded within composed types (arrays and objects).

JSON 有四种原始类型或原子对象(字符串、数字、布尔值、空),可嵌入到组成类型(数组和对象)中。对于做JS开发的人来说应该不用强调🤗

To access the data contained in a JSON document from software, it is typical to transform the JSON text into a tree-like logical representation, akin to the righthand-side of Fig. 1, an operation we call JSON parsing.

这确实一直没注意到:解析器会把JSON转为树状逻辑表示(程序员得到即使字典Dict或数组Array)

image-20241222101045163

Parsing large JSON documents is a common task. Palkar et al. state that big-data applications can spend 80–90% of their time parsing JSON documents [25]. Boncz et al. identified the acceleration of JSON parsing as a topic of interest for speeding up database processing [2].

作为SIMD加速的背景介绍

For example, starting with the Haswell microarchitecture (2013), Intel and AMD processors support the AVX2 instruction set and 256-bit vector registers.

这么看来,AVX2和AVX512距离现在也就不过10年光景

Hence, on recent x64 processors, we can compare two strings of 32 characters in a single instruction.

使用SIMD可以比较32个Char的字符的字符串是否相同

In our experience, SIMD instructions are most likely to be beneficial in a branchless setting

SIMD在无分支环境下效果最佳

To our knowledge, publicly available JSON validating parsers make little use of SIMD instructions. Due to its complexity, the full JSON parsing problem may not appear immediately amenable to vectorization.

这可是大实话😂😅看你论文不就是想看看你们怎么做的么

One of our core results is that SIMD instructions combined with minimal branching can lead to new speed records for JSON parsing—often processing gigabytes of data per second on a single core.

在最小分支的情况下,每秒处理G级别的Json数据

– We detect quoted strings, using solely arithmetic and logical operations and a fixed number of instructions per input bytes, while omitting escaped quotes (§ 3.1.1).

我们检测带引号的字符串,只使用算术和逻辑操作以及每个输入字节的固定数量的指令,同时省略转义的引号(§3.1.1)。

– We differentiate between sets of code-point values using vectorized classification thus avoiding the burden of doing N comparisons to recognize that a value is part of a set of size N (§ 3.1.2).

我们使用向量化分类对码位值集进行区分,从而避免了进行 N 次比较来识别一个值是大小为 N 的集合的一部分的负担(§3.1.2)。

– We validate UTF-8 strings using solely SIMD instructions (§ 3.1.5).

我们使用 SIMD 指令验证 UTF-8 字符串(§3.1.3)。

Mark下😍SIMD可用于字符串比较,实现集合判定(基于二进制位码)——这是不是也是一种索引?

而关于Vectorized Validate UTF-8 strings,指的是用SIMD指令验证字符是否符合UTF-8规范,从而避免混入ASCII,GB2312等编码方式

A common strategy to accelerate JSON parsing in the literature is to parse selectively

确实只能选择性解析:SIMD需要对齐才能使用,而实际情况当中,想要对齐数据还是有难度的

Bonetta and Brantner use speculative just-in-time (JIT) compilation and selective data access to speed up JSON processing [3]

Mark下,JIT也能加速JSON处理

Li et al. present their fast parser, Mison which can jump directly to a queried field without parsing intermediate content [17]

Mison也是使用SIMD加速JSON解析的项目,但5年过去,simdjson这个库的知名度是远高于mison的

相关工作还包括XML解析,CSV解析,详情可见论文

3 Parser Architecture and Implementation

解析器分为两个部分

第 1 阶段,验证字符编码并标识所有 JSON 节点的起始位置(例如,数字、字符串、null、true、false、数组、对象),SIMD在其中的作用是对字节进行处理或对位集(位数组)进行操作

第 2 阶段,处理所有节点和结构字符。根据节点的起始字符来区分它们。当遇到引号(‘"‘)时,解析一个字符串;当找到一个数字或连字符时,解析一个数字;当找到字母 ‘t‘,’f‘,’n’ 时,寻找值 truefalsenull

这里应该就用到了向量化UTF-8校验了

4 Experiments

硬件: Intel Skylake

软件: C++17(Clang, MVSC, GCC)

image-20241222110726814

5 Conclusion and Future Work

Though the application of SIMD instructions for parsing is not novel [5], our results suggest that they are underutilized in popular JSON parsers.

“underutilized in popular JSON parsers.” 啧啧😎

Base64 data can be decoded quickly using SIMD instructions [21]

Really?也可以Mark下

结语

先看到这里,我想要的信息已经拿到了。能被Facebook的Velox使用已经证明是成功的

参考资料

每秒解析 GB 级别的 JSON 文件