On-Demand JSON: A Better Way to Parse Documents? (2312.17149v3)

Published 28 Dec 2023 in cs.DB and cs.PF

Abstract: JSON is a popular standard for data interchange on the Internet. Ingesting JSON documents can be a performance bottleneck. A popular parsing strategy consists in converting the input text into a tree-based data structure -- sometimes called a Document Object Model or DOM. We designed and implemented a novel JSON parsing interface -- called On-Demand -- that appears to the programmer like a conventional DOM-based approach. However, the underlying implementation is a pointer iterating through the content, only materializing the results (objects, arrays, strings, numbers) lazily.On recent commodity processors, an implementation of our approach provides superior performance in multiple benchmarks. To ensure reproducibility, our work is freely available as open source software. Several systems use On-Demand: e.g., Apache Doris, the Node.js JavaScript runtime, Milvus, and Velox.

References (29)

Citations (3)

View on Semantic Scholar

Summary

The paper presents a novel JSON parsing method that leverages lazy evaluation and indexing to optimize performance without materializing entire documents.
It employs modern CPU features like SIMD to achieve processing speeds up to 8.0 GiB/s with just 3.1 instructions per byte on average.
The approach combines low memory overhead with DOM-like usability, making it ideal for high-throughput and flexible data parsing applications.

On-Demand JSON: A Better Way to Parse Documents?

Introduction

The paper "On-Demand JSON: A Better Way to Parse Documents?" by John Keiser and Daniel Lemire introduces an innovative JSON parsing approach designed to mitigate performance bottlenecks associated with traditional JSON parsing methods. The authors present a novel strategy called On-Demand, which offers a significant departure from the commonly used Document Object Model (DOM) and streaming-based strategies.

Traditional Approaches to JSON Parsing

Traditional JSON parsing strategies include:

DOM-Based Parsing: Converts the entire JSON document into a tree-like in-memory data structure, facilitating easy navigation and manipulation but at the cost of increased memory usage and unnecessary data materialization.
Streaming-Based Parsing (e.g., SAX): Processes the document sequentially, triggering events as components are encountered. This method is memory efficient and suitable for extracting subsets of the data but tends to be complex for general-purpose tasks due to manual state management and potential inefficiencies.

On-Demand Parsing Approach

The On-Demand JSON parser, as described, offers a hybrid solution intended to combine the high performance typically associated with streaming parsers and the ease of use of DOM-like interfaces. Its key features include:

Lazy Evaluation: JSON components are parsed only when needed. For example, an array is presented as an iterator over its elements, and objects are accessed via key-value pairs on-the-fly, without pre-materializing the entire structure.
Flexible Data Access: On-Demand supports versatile type casting, allowing numbers to be parsed as integers, floating points, or strings as required. It also supports arbitrary schema detection through the type() method.
Index-Driven Parsing: An index is created during an initial pass over the document, denoting the positions of JSON nodes and structural characters. This indexed approach facilitates rapid navigation and parsing of the document's components without pre-building the entire tree.

Implementation and Performance

The implementation leverages modern CPU features, such as SIMD instructions, to accelerate JSON parsing. The authors benchmarked their On-Demand implementation against several state-of-the-art parsers including yyjson, RapidJSON, and JSON for Modern C++. The following tasks were used in benchmarking:

json2msgpack: Conversion of JSON to MessagePack format, covering the complete document.
partial tweets: Extracting specific fields from each tweet in a Twitter dataset.
distinct user: Gathering unique user IDs from tweets and retweets.
find tweet: Locating a tweet based on its ID.
top tweet: Identifying the most retweeted tweet.
kostya benchmark: Processing JSON objects containing triples of floating-point values.
large random dataset: Similar to kostya but with larger synthetic data.

Findings

The On-Demand parser consistently outperformed other parsers across multiple benchmarks:

Processing Speed: It achieved a speed of over \SI{3.3}{\gibi\byte\per\second} on average, peaking at \SI{8.0}{\gibi\byte\per\second} for specific tasks like find tweet.
Instruction Efficiency: On-Demand showed a low instruction count per byte, averaging \SI{3.1}{} instructions per byte, demonstrating its computational efficiency.

Implications and Future Work

The findings suggest that On-Demand provides a powerful alternative to traditional JSON parsers, especially where performance and memory efficiency are critical. Practically, this can benefit applications with high-throughput JSON data processing requirements, such as web servers and database systems. Additionally, the ease of use comparable to DOM-based approaches makes it accessible for general-purposes.

Future directions for this research could include:

Language Porting: Implementing On-Demand parsers in other popular programming languages such as Java, Rust, and Go to broaden its applicability.
Extended Indexing: Exploring richer indexing techniques that can capture more detailed schema information, potentially enhancing performance further.
Heterogeneous Computing: Investigating the use of GPUs and FPGAs for the indexing phase, leveraging their parallel processing capabilities to accelerate the initial pass over the JSON document.

Conclusion

The On-Demand JSON parsing interface represents a significant step forward in the field of data interchange formats, offering a balanced approach that combines high performance and usability. The paper sets a foundation for future exploration into more sophisticated and flexible JSON parsing techniques. This paper's contributions are particularly relevant for developers and researchers focusing on optimized data parsing in performance-critical applications.

PDF Markdown

Related Papers

Parsing Gigabytes of JSON per Second (2019)
JSON Stats Analyzer (2022)
JSON: data model, query languages and schema specification (2017)
From XML Schema to JSON Schema: Translation with CHR (2014)
Synthesizing JSON Schema Transformers (2024)

GitHub

GitHub - simdjson/simdjson: Parsing gigabytes of JSON per second : used by Facebook/Meta Velox, the Node.js runtime, WatermelonDB, Apache Doris, Milvus, StarRocks (20,802 stars)

Tweets

https://twitter.com/mraleph/status/1770874651524182501

https://twitter.com/edefazio/status/1748074692232036583