On-Demand JSON: A Better Way to Parse Documents? (2312.17149v3)
Abstract: JSON is a popular standard for data interchange on the Internet. Ingesting JSON documents can be a performance bottleneck. A popular parsing strategy consists in converting the input text into a tree-based data structure -- sometimes called a Document Object Model or DOM. We designed and implemented a novel JSON parsing interface -- called On-Demand -- that appears to the programmer like a conventional DOM-based approach. However, the underlying implementation is a pointer iterating through the content, only materializing the results (objects, arrays, strings, numbers) lazily.On recent commodity processors, an implementation of our approach provides superior performance in multiple benchmarks. To ensure reproducibility, our work is freely available as open source software. Several systems use On-Demand: e.g., Apache Doris, the Node.js JavaScript runtime, Milvus, and Velox.
- Bray T, The JavaScript Object Notation (JSON) Data Interchange Format; 2017. Internet Engineering Task Force, Request for Comments: 8259. https://tools.ietf.org/html/rfc8259.
- JSON Tiles: Fast Analytics on Semi-Structured Data. In: Proceedings of the 2021 International Conference on Management of Data SIGMOD ’21, New York, NY, USA: Association for Computing Machinery; 2021. p. 445–458. https://doi.org/10.1145/3448016.3452809.
- Jiang L, Zhao Z. In: JSONSki: Streaming Semi-Structured Data with Bit-Parallel Fast-Forwarding New York, NY, USA: Association for Computing Machinery; 2022. p. 200–211. https://doi.org/10.1145/3503222.3507719.
- Document Object Model (DOM) level 1 specification. W3C; 1998.
- Means SS. The Book of SAX. USA: No Starch Press; 2002.
- Scalable Processing of Contemporary Semi-Structured Data on Commodity Parallel Processors - A Compilation-Based Approach. In: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems ASPLOS ’19, New York, NY, USA: Association for Computing Machinery; 2019. p. 79–92. https://doi.org/10.1145/3297858.3304008.
- Using Serde to Serialize and Deserialize DIS PDUs. In: 2020 International Conference on Computational Science and Computational Intelligence (CSCI) IEEE; 2020. p. 1425–1428.
- Langdale G, Lemire D. Parsing gigabytes of JSON per second. The VLDB Journal 2019;28(6):941–960.
- Henderson P, Morris Jr JH. A lazy evaluator. In: Proceedings of the 3rd ACM SIGACT-SIGPLAN Symposium on Principles on Programming Languages; 1976. p. 95–103.
- Lemire D. Number parsing at a gigabyte per second. Software: Practice and Experience 2021;51(8):1700–1727.
- Keiser J, Lemire D. Validating UTF-8 in less than one instruction per byte. Software: Practice and Experience 2021;51(5):950–964.
- PipeJSON: Parsing JSON at Line Speed on FPGAs. In: Data Management on New Hardware DaMoN’22, New York, NY, USA: Association for Computing Machinery; 2022. https://doi.org/10.1145/3533737.3535094.
- Raw filtering of JSON data on FPGAs. In: 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE) IEEE; 2022. p. 250–255.
- Tens of gigabytes per second JSON-to-Arrow conversion with FPGA accelerators. In: 2021 International Conference on Field-Programmable Technology (ICFPT) IEEE; 2021. p. 1–9.
- Stehle E, Jacobsen HA. ParPaRaw: Massively Parallel Parsing of Delimiter-Separated Raw Data. Proc VLDB Endow 2020 jan;13(5):616–628. https://doi.org/10.14778/3377369.3377372.
- Scalable Structural Index Construction for JSON Analytics. Proc VLDB Endow 2020 dec;14(4):694–707. https://doi.org/10.14778/3436905.3436926.
- A Parallel and Scalable Processor for JSON Data. In: EDBT’18; 2018. .
- NoDB in Action: Adaptive Query Processing on Raw Data. Proc VLDB Endow 2012 Aug;5(12):1942–1945.
- Bonetta D, Brantner M. FAD.Js: Fast JSON Data Access Using JIT-based Speculative Optimizations. Proc VLDB Endow 2017 Aug;10(12):1778–1789.
- Mison: A Fast JSON Parser for Data Analytics. Proc VLDB Endow 2017 Jun;10(10):1118–1129.
- FishStore: Faster Ingestion with Subset Hashing. In: Proceedings of the 2019 International Conference on Management of Data SIGMOD ’19, New York, NY, USA: ACM; 2019. p. 1711–1728.
- FASTER: A Concurrent Key-Value Store with In-Place Updates. In: Proceedings of the 2018 International Conference on Management of Data SIGMOD ’18, New York, NY, USA: ACM; 2018. p. 275–290.
- Filter before you parse: faster analytics on raw data with Sparser. Proceedings of the VLDB Endowment 2018;11(11):1576–1589.
- https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html.
- Mansard roofline model: Reinforcing the accuracy of the roofs. ACM Transactions on Modeling and Performance Evaluation of Computing Systems 2021;6(2):1–23.
- Performance Analysis with Unified Hardware Counter Metrics. In: 2022 IEEE/ACM International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) IEEE; 2022. p. 60–70.
- Ching T, Eddelbuettel D. RcppMsgPack: MessagePack Headers and Interface Functions for R. R Journal 2018;10(2).
- Hoefler T, Belli R. Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results. In: Proceedings of the international conference for high performance computing, networking, storage and analysis; 2015. p. 1–12.
- Heterogeneous computing: Challenges and opportunities. Computer 1993;26(6):18–27.