Efficient JSONL Substructure Search
- JSONL substructure search is a technique that efficiently identifies complex, nested JSON fragments by mapping query trees to JSON object trees.
- It leverages advanced declarative pattern languages like JPQ and optimized data structures such as jXBW and the eXtended Burrows-Wheeler Transform to achieve scalable, low-latency queries.
- Empirical results demonstrate significant speedups and memory efficiency improvements, making the approach ideal for foundation model prompt engineering and large-scale data retrieval.
JSONL substructure search is the task of efficiently identifying instances of complex, nested JSON fragments (subtrees) that match a given query pattern within large JSON Lines (JSONL) datasets. This problem connects to pattern-based querying, tree-structured data representation, and scalable indexing for high-throughput workloads such as those in foundation model prompt construction and information retrieval. Recent research focuses on both the expressive power of declarative pattern languages and advanced data structures that enable scalable, low-latency search beyond naïve tree traversal.
1. Formal Problem Statement and Data Model
In JSONL substructure search, each line of a JSONL file is parsed as an independent JSON object . Each object induces a rooted, labeled tree , where nodes are fields, array values, or leaves, and maps each node to a label in some finite alphabet (object keys, integer indices, or atomic values) (Tabei, 18 Aug 2025).
A query is a JSON fragment, also formalized as a tree . The substructure matching problem is: for each , does there exist an injective map such that
- is the parent of in 0
- For arrays, the sibling order is preserved; for objects, ordering is ignored.
The goal is to return all 1 such that 2 is a substructure of 3 under such a mapping (Tabei, 18 Aug 2025). In the JPQ pattern-query model, the document data model is a simplified JSON-like hierarchy (JHM), supporting atoms, arrays, and objects with unique keys within each object (Li et al., 2015).
2. Pattern Languages for Substructure Extraction
Declarative pattern languages such as JPQ formalize queries as compositional, tree-like patterns that go well beyond static JSONPath or simple projection (Li et al., 2015). Extraction patterns in JPQ are of two main types:
- Key-value patterns (4): Match keys and descend into values, supporting variables, string predicates (5), wildcards (6), conjunction, and disjunction.
- Value patterns (7): Match values, atomic variables, wildcards, object and array structures, pairing (conjunctive matching), option patterns, and recursive enumeration (child 8 and descendant 9).
Fundamental grammar rules allow for nested pattern matching and binding of variables to deeply nested fields. For example, a pattern such as
6 extracts multiple interdependent fields in one shot.
JPQ’s logical environment semantics 0 provides a basis for precise reasoning about pattern matching, including environment extraction, conjunctive and disjunctive composition, and enumeration rules. Resulting matches can be declaratively reshaped and filtered using constraining and transformation clauses, with a term-rewriting system dictating output shape (Li et al., 2015).
3. Data Structures and Indexing for Scalable Search
Naïve substructure search involves for each 1 in the corpus, traversing every node and attempting to match 2 via subtree isomorphism, yielding complexity 3 (Tabei, 18 Aug 2025). This is computationally prohibitive for large-scale corpora.
The jXBW approach addresses this by merging all corpus trees into a single merged tree structure (MT), akin to a trie of all root-to-node label sequences. Leaves of MT store sets of input object indices sharing a given path. The merged tree size is 4; merging is performed via a divide-and-conquer scheme in 5 time, where 6 is the total number of nodes across all trees (Tabei, 18 Aug 2025).
jXBW then applies the eXtended Burrows-Wheeler Transform (XBW) to MT, producing synchronized arrays:
- 7: labels per node, wavelet-matrix indexed
- 8: marks rightmost siblings
- 9: marks leaves
- 0: compacted sets of tree IDs for leaves
These structures enable succinct indexing and efficient navigation, including primitive queries (children, parent, subtree, ranked child retrieval) and subpath search, all within 1 or constant time.
4. Substructure Search Algorithms
In the jXBW framework, substructure search is executed in three steps:
- Path decomposition: Extract all root-to-leaf label sequences in 2.
- Subpath search: Using jXBW, locate corresponding ranges in the merged MT for each query path via efficient wavelet and rank/select operations.
- Common ancestor and tree ID collection: Intersect ancestors for all query paths to identify candidate roots for 3; for each such root, verify structure match (especially for arrays and objects). Retrieve the set of input tree IDs (4) containing substructure matches (Tabei, 18 Aug 2025).
Total runtime is 5, where 6 is the number of root-leaf paths in 7, 8 is average depth, 9 is average branching, 0 is total result count, and 1 is the number of candidate root positions.
5. Expressiveness, Filtering, and Reshaping
Pattern-based approaches, exemplified by JPQ, enable complex extraction and transformation in a single declarative query. Features include:
- Interdependent field extraction via tree-patterns
- Option patterns for heterogeneity
- Array and nested object matching
- Term-rewriting for customizable output data structure
- Advanced filtering: boolean predicates, quantified conditions, parallel and hierarchical (with
par,with) filtering operators
This enables concise formulation of queries that are inexpressible or cumbersome using path-based schemes such as JSONPath. For example, extracting and grouping records, or applying nested filters, can be achieved without resorting to post-processing or host language loops (Li et al., 2015). However, richer pattern matching can require significant search, especially when extensive branch or recursive descent (//p) is involved.
6. Empirical Performance and Scalability
Experimental results with jXBW indicate strong performance benefits over prior techniques. On datasets ranging from tens of thousands to several million JSON objects, jXBW achieves 162–4,7003 speedups over tree-based baselines and over 4 improvement relative to XML-based methods (Saxon/XQuery). For the largest datasets (e.g., osm_tokyo, with 6.6M objects), jXBW responds in 0.026 ms (mean), while pointer-based merged trees require over 73 ms (Tabei, 18 Aug 2025).
Memory usage for jXBW is competitive, using 15–30% less memory than pointer-based trees and scaling well as corpus size grows. Construction times are similar to pointer-based and succinct trees, with one-off cost amortized by repeated queries.
Summary Table: Query Latency (ms) across Methods
| Dataset | jXBW | Pointer Tree | Succinct Tree | Saxon (XML) |
|---|---|---|---|---|
| movies | 0.016 | 0.265 | 2.054 | 3.958 |
| electric_vehicle | 0.038 | 1.278 | 10.264 | 40.224 |
| mta_paratransit | 0.034 | 5.619 | 48.716 | 167.929 |
| osm_tokyo | 0.026 | 73.127 | 373.371 | 1954.94 |
7. Practical Considerations, Extensions, and Impact
For foundation model applications, including prompt engineering with large context corpora, rapid substructure search enables real-time retrieval of targeted few-shot examples or knowledge slices, avoiding costly dataset scans (Tabei, 18 Aug 2025). Implementation guidance emphasizes symbol table optimization (minimizing alphabet size 5), parallel tree merging, and wavelet matrix-based rank/select. Array-heavy queries benefit from cache-optimized child access.
Extensions proposed for the jXBW framework include approximate substructure search (edit-distance bounds), conversion to related formats (Avro, MessagePack), distributed indexing for scaling to >100 GB, and LLM-guided query expansion (using embeddings to suggest related subtrees for recall enhancement).
Pattern-based querying languages such as JPQ offer powerful declarative extraction and restructuring, raising the abstraction level of substructure search but requiring advanced pattern matching and term rewriting in the query processor (Li et al., 2015).
The demonstrated improvements in query latency and memory efficiency, along with better support for complex queries, establish JSONL substructure search as both a critical enabler and a moving target for scalable data engineering and downstream machine learning workflows.