Parsing Gigabytes of JSON per Second (1902.08318v7)

Published 22 Feb 2019 in cs.DB and cs.PF

Abstract: JavaScript Object Notation or JSON is a ubiquitous data exchange format on the Web. Ingesting JSON documents can become a performance bottleneck due to the sheer volume of data. We are thus motivated to make JSON parsing as fast as possible. Despite the maturity of the problem of JSON parsing, we show that substantial speedups are possible. We present the first standard-compliant JSON parser to process gigabytes of data per second on a single core, using commodity processors. We can use a quarter or fewer instructions than a state-of-the-art reference parser like RapidJSON. Unlike other validating parsers, our software (simdjson) makes extensive use of Single Instruction, Multiple Data (SIMD) instructions. To ensure reproducibility, simdjson is freely available as open-source software under a liberal license.

Citations (51)

View on Semantic Scholar

Summary

The paper presents simdjson, a novel JSON parser that leverages SIMD instructions to achieve gigabyte-per-second parsing speeds.
It employs a two-stage architecture to efficiently identify structural characters and parse data with reduced CPU instructions.
Empirical analysis shows simdjson outperforms competitors like RapidJSON by significantly cutting CPU cycles and processing time.

Parsing Gigabytes of JSON per Second: An Expert's Analysis

The paper "Parsing Gigabytes of JSON per Second" by Geoff Langdale and Daniel Lemire introduces a novel approach to JSON parsing that significantly enhances performance by leveraging Single Instruction, Multiple Data (SIMD) instructions. This parser, termed simdjson, achieves a breakthrough in JSON parsing speed, processing gigabytes of data per second on commodity processors. The paper meticulously discusses the challenges and innovations involved in making JSON parsing as efficient as possible, ultimately yielding a parser that can outperform its contemporaries significantly.

Introduction

JSON (JavaScript Object Notation) is an omnipresent format for data interchange on the web, supported by myriad systems and languages. However, the sheer volume of data transmitted via JSON can create a bottleneck, necessitating fast parsers to optimize performance. Despite the maturity of JSON parsing, the authors argue that there is substantial room for improvement. Their work introduces simdjson, the first JSON parser fully compliant with standard specifications capable of processing gigabytes per second on a single core.

Architecture and Key Innovations

The architecture of simdjson employs a two-stage process. The first stage focuses on identifying structural characters and pseudo-structural characters (characters significant to JSON structure but not syntactically structural). The second stage processes these identified segments to complete parsing:

Stage 1 – Structural and Pseudo-Structural Identification:
- SIMD instructions are employed to quickly identify structural characters and perform UTF-8 validation.
- Key operations include vectorized classification and branchless processing to determine quoted substrings, which helps skip unnecessary characters.
- Structural and pseudo-structural elements are identified using techniques like vectorized table lookups and efficient bitwise operations.
Stage 2 – Data Parsing:
- This stage processes identified segments, performing actual data parsing including numbers, strings, and validation of JSON structures (arrays, objects, etc.).
- The parsing process efficiently handles various data types and ensures the validity of the parsed data.
- Performance tweaks such as vectorized number parsing and efficient string normalization techniques are utilized.

Empirical Analysis

The paper presents an exhaustive performance analysis, comparing simdjson against RapidJSON and sajson, two prominent alternatives. The experiments reveal that simdjson uses significantly fewer instructions per byte, resulting in faster execution times. Key findings include:

Comparison of Parsing Speed: simdjson consistently outperforms other parsers, achieving speeds exceeding 2GB/s in many cases on Skylake processors. It processes large files effectively without any significant impact from data exceeding cache sizes.
Instruction Efficiency: The parser uses about half as many instructions as its competitors, attributing its efficiency to extensive use of SIMD instructions and branchless processing.
Stages Breakdown: Detailed analysis shows the distribution of CPU cycles across different stages and tasks, highlighting the efficiency and optimization at each step.

Theoretical and Practical Implications

The use of SIMD instructions to optimize JSON parsing heralds significant theoretical and practical advancements:

Theoretical Implications:
- The successful application of SIMD for text-based format parsing can encourage their adoption in other text parsing tasks, impacting research in high-performance computing and data processing.
- The approach can drive further exploration into branchless processing and vectorized classification as powerful tools in parsing and data processing.
Practical Implications:
- For industries relying on real-time data processing and large-scale data ingestion, simdjson offers a substantial performance boost, cutting processing times and resource usage.
- The open-source nature of simdjson allows for its integration into various applications and systems, fostering widespread adoption and possibly becoming a benchmark in JSON parsing.

Future Directions in AI

The advancements in parsing efficiency can influence future developments in AI and machine learning:

AI Training Data Ingestion: Improved data parsing speeds can streamline the ingestion process for training large-scale AI models, enhancing efficiency in AI development cycles.
Real-time Data Processing: Enhanced parsing techniques can support real-time decision-making systems and data-driven applications, contributing to advancements in AI-driven technologies and services.

Conclusion

The efficacy of simdjson in leveraging SIMD instructions and other optimization techniques positions it as a leading example in JSON parsing. The meticulous breakdown of the parser’s architecture, backed by empirical performance measures, illustrates how substantial performance gains are achieved over traditional methods. As data volumes continue to grow, tools like simdjson will be crucial in maintaining performance and efficiency in data-centric applications. Future work can explore extending these techniques to other formats and further enhancing parsing through emerging hardware capabilities.

PDF Markdown

Related Papers

On-Demand JSON: A Better Way to Parse Documents? (2023)
Benchmarking JSON BinPack (2022)
The Behavioral Diversity of Java JSON Libraries (2021)
From XML Schema to JSON Schema: Translation with CHR (2014)
Synthesizing JSON Schema Transformers (2024)

GitHub

GitHub - simdjson/simdjson: Parsing gigabytes of JSON per second : used by Facebook/Meta Velox, the Node.js runtime, ClickHouse, WatermelonDB, Apache Doris, Milvus, StarRocks (20,801 stars)

Tweets

https://twitter.com/vivekgalatage/status/1833798133286842806

https://twitter.com/ishuah_/status/1921219339027382562

https://twitter.com/aepau2/status/1783941999856325042

HackerNews

Parsing Gigabytes of JSON per Second (5 points, 1 comment)
simdjson: Parsing Gigabytes of JSON per Second (2 points, 0 comments)