Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On-Demand JSON: A Better Way to Parse Documents? (2312.17149v3)

Published 28 Dec 2023 in cs.DB and cs.PF

Abstract: JSON is a popular standard for data interchange on the Internet. Ingesting JSON documents can be a performance bottleneck. A popular parsing strategy consists in converting the input text into a tree-based data structure -- sometimes called a Document Object Model or DOM. We designed and implemented a novel JSON parsing interface -- called On-Demand -- that appears to the programmer like a conventional DOM-based approach. However, the underlying implementation is a pointer iterating through the content, only materializing the results (objects, arrays, strings, numbers) lazily.On recent commodity processors, an implementation of our approach provides superior performance in multiple benchmarks. To ensure reproducibility, our work is freely available as open source software. Several systems use On-Demand: e.g., Apache Doris, the Node.js JavaScript runtime, Milvus, and Velox.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. Bray T, The JavaScript Object Notation (JSON) Data Interchange Format; 2017. Internet Engineering Task Force, Request for Comments: 8259. https://tools.ietf.org/html/rfc8259.
  2. JSON Tiles: Fast Analytics on Semi-Structured Data. In: Proceedings of the 2021 International Conference on Management of Data SIGMOD ’21, New York, NY, USA: Association for Computing Machinery; 2021. p. 445–458. https://doi.org/10.1145/3448016.3452809.
  3. Jiang L, Zhao Z. In: JSONSki: Streaming Semi-Structured Data with Bit-Parallel Fast-Forwarding New York, NY, USA: Association for Computing Machinery; 2022. p. 200–211. https://doi.org/10.1145/3503222.3507719.
  4. Document Object Model (DOM) level 1 specification. W3C; 1998.
  5. Means SS. The Book of SAX. USA: No Starch Press; 2002.
  6. Scalable Processing of Contemporary Semi-Structured Data on Commodity Parallel Processors - A Compilation-Based Approach. In: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems ASPLOS ’19, New York, NY, USA: Association for Computing Machinery; 2019. p. 79–92. https://doi.org/10.1145/3297858.3304008.
  7. Using Serde to Serialize and Deserialize DIS PDUs. In: 2020 International Conference on Computational Science and Computational Intelligence (CSCI) IEEE; 2020. p. 1425–1428.
  8. Langdale G, Lemire D. Parsing gigabytes of JSON per second. The VLDB Journal 2019;28(6):941–960.
  9. Henderson P, Morris Jr JH. A lazy evaluator. In: Proceedings of the 3rd ACM SIGACT-SIGPLAN Symposium on Principles on Programming Languages; 1976. p. 95–103.
  10. Lemire D. Number parsing at a gigabyte per second. Software: Practice and Experience 2021;51(8):1700–1727.
  11. Keiser J, Lemire D. Validating UTF-8 in less than one instruction per byte. Software: Practice and Experience 2021;51(5):950–964.
  12. PipeJSON: Parsing JSON at Line Speed on FPGAs. In: Data Management on New Hardware DaMoN’22, New York, NY, USA: Association for Computing Machinery; 2022. https://doi.org/10.1145/3533737.3535094.
  13. Raw filtering of JSON data on FPGAs. In: 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE) IEEE; 2022. p. 250–255.
  14. Tens of gigabytes per second JSON-to-Arrow conversion with FPGA accelerators. In: 2021 International Conference on Field-Programmable Technology (ICFPT) IEEE; 2021. p. 1–9.
  15. Stehle E, Jacobsen HA. ParPaRaw: Massively Parallel Parsing of Delimiter-Separated Raw Data. Proc VLDB Endow 2020 jan;13(5):616–628. https://doi.org/10.14778/3377369.3377372.
  16. Scalable Structural Index Construction for JSON Analytics. Proc VLDB Endow 2020 dec;14(4):694–707. https://doi.org/10.14778/3436905.3436926.
  17. A Parallel and Scalable Processor for JSON Data. In: EDBT’18; 2018. .
  18. NoDB in Action: Adaptive Query Processing on Raw Data. Proc VLDB Endow 2012 Aug;5(12):1942–1945.
  19. Bonetta D, Brantner M. FAD.Js: Fast JSON Data Access Using JIT-based Speculative Optimizations. Proc VLDB Endow 2017 Aug;10(12):1778–1789.
  20. Mison: A Fast JSON Parser for Data Analytics. Proc VLDB Endow 2017 Jun;10(10):1118–1129.
  21. FishStore: Faster Ingestion with Subset Hashing. In: Proceedings of the 2019 International Conference on Management of Data SIGMOD ’19, New York, NY, USA: ACM; 2019. p. 1711–1728.
  22. FASTER: A Concurrent Key-Value Store with In-Place Updates. In: Proceedings of the 2018 International Conference on Management of Data SIGMOD ’18, New York, NY, USA: ACM; 2018. p. 275–290.
  23. Filter before you parse: faster analytics on raw data with Sparser. Proceedings of the VLDB Endowment 2018;11(11):1576–1589.
  24. https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html.
  25. Mansard roofline model: Reinforcing the accuracy of the roofs. ACM Transactions on Modeling and Performance Evaluation of Computing Systems 2021;6(2):1–23.
  26. Performance Analysis with Unified Hardware Counter Metrics. In: 2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) IEEE; 2022. p. 60–70.
  27. Ching T, Eddelbuettel D. RcppMsgPack: MessagePack Headers and Interface Functions for R. R Journal 2018;10(2).
  28. Hoefler T, Belli R. Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results. In: Proceedings of the international conference for high performance computing, networking, storage and analysis; 2015. p. 1–12.
  29. Heterogeneous computing: Challenges and opportunities. Computer 1993;26(6):18–27.
Citations (3)

Summary

  • The paper presents a novel JSON parsing method that leverages lazy evaluation and indexing to optimize performance without materializing entire documents.
  • It employs modern CPU features like SIMD to achieve processing speeds up to 8.0 GiB/s with just 3.1 instructions per byte on average.
  • The approach combines low memory overhead with DOM-like usability, making it ideal for high-throughput and flexible data parsing applications.

On-Demand JSON: A Better Way to Parse Documents?

Introduction

The paper "On-Demand JSON: A Better Way to Parse Documents?" by John Keiser and Daniel Lemire introduces an innovative JSON parsing approach designed to mitigate performance bottlenecks associated with traditional JSON parsing methods. The authors present a novel strategy called On-Demand, which offers a significant departure from the commonly used Document Object Model (DOM) and streaming-based strategies.

Traditional Approaches to JSON Parsing

Traditional JSON parsing strategies include:

  1. DOM-Based Parsing: Converts the entire JSON document into a tree-like in-memory data structure, facilitating easy navigation and manipulation but at the cost of increased memory usage and unnecessary data materialization.
  2. Streaming-Based Parsing (e.g., SAX): Processes the document sequentially, triggering events as components are encountered. This method is memory efficient and suitable for extracting subsets of the data but tends to be complex for general-purpose tasks due to manual state management and potential inefficiencies.

On-Demand Parsing Approach

The On-Demand JSON parser, as described, offers a hybrid solution intended to combine the high performance typically associated with streaming parsers and the ease of use of DOM-like interfaces. Its key features include:

  • Lazy Evaluation: JSON components are parsed only when needed. For example, an array is presented as an iterator over its elements, and objects are accessed via key-value pairs on-the-fly, without pre-materializing the entire structure.
  • Flexible Data Access: On-Demand supports versatile type casting, allowing numbers to be parsed as integers, floating points, or strings as required. It also supports arbitrary schema detection through the type() method.
  • Index-Driven Parsing: An index is created during an initial pass over the document, denoting the positions of JSON nodes and structural characters. This indexed approach facilitates rapid navigation and parsing of the document's components without pre-building the entire tree.

Implementation and Performance

The implementation leverages modern CPU features, such as SIMD instructions, to accelerate JSON parsing. The authors benchmarked their On-Demand implementation against several state-of-the-art parsers including yyjson, RapidJSON, and JSON for Modern C++. The following tasks were used in benchmarking:

  1. json2msgpack: Conversion of JSON to MessagePack format, covering the complete document.
  2. partial tweets: Extracting specific fields from each tweet in a Twitter dataset.
  3. distinct user: Gathering unique user IDs from tweets and retweets.
  4. find tweet: Locating a tweet based on its ID.
  5. top tweet: Identifying the most retweeted tweet.
  6. kostya benchmark: Processing JSON objects containing triples of floating-point values.
  7. large random dataset: Similar to kostya but with larger synthetic data.

Findings

The On-Demand parser consistently outperformed other parsers across multiple benchmarks:

  • Processing Speed: It achieved a speed of over \SI{3.3}{\gibi\byte\per\second} on average, peaking at \SI{8.0}{\gibi\byte\per\second} for specific tasks like find tweet.
  • Instruction Efficiency: On-Demand showed a low instruction count per byte, averaging \SI{3.1}{} instructions per byte, demonstrating its computational efficiency.

Implications and Future Work

The findings suggest that On-Demand provides a powerful alternative to traditional JSON parsers, especially where performance and memory efficiency are critical. Practically, this can benefit applications with high-throughput JSON data processing requirements, such as web servers and database systems. Additionally, the ease of use comparable to DOM-based approaches makes it accessible for general-purposes.

Future directions for this research could include:

  • Language Porting: Implementing On-Demand parsers in other popular programming languages such as Java, Rust, and Go to broaden its applicability.
  • Extended Indexing: Exploring richer indexing techniques that can capture more detailed schema information, potentially enhancing performance further.
  • Heterogeneous Computing: Investigating the use of GPUs and FPGAs for the indexing phase, leveraging their parallel processing capabilities to accelerate the initial pass over the JSON document.

Conclusion

The On-Demand JSON parsing interface represents a significant step forward in the field of data interchange formats, offering a balanced approach that combines high performance and usability. The paper sets a foundation for future exploration into more sophisticated and flexible JSON parsing techniques. This paper's contributions are particularly relevant for developers and researchers focusing on optimized data parsing in performance-critical applications.

HackerNews