SoftMatcha 2: Ultra-Fast Soft Pattern Matching
- SoftMatcha 2 is an ultra-fast algorithm enabling soft pattern matching in trillion-scale corpora with sub-millisecond exact lookup and sub-0.3s soft search.
- It employs a two-level, disk-aware suffix array and dynamic, corpus-aware pruning to efficiently generate semantic variants with minimal lookup overhead.
- Empirical evaluations demonstrate up to 475× performance improvement over previous methods, supporting large-scale multilingual text analyses.
SoftMatcha 2 is an ultra-fast, semantically relaxed pattern matching algorithm designed for interactive search over trillion-scale natural language corpora. It achieves sub-millisecond exact lookup and sub-0.3s soft search, supporting semantic variations such as substitutions, insertions, and deletions. The algorithm addresses the combinatorial challenges posed by semantic relaxation through disk-aware indexing and dynamic, corpus-aware pruning, enabling deployment across multi-terabyte text datasets and multilingual scenarios (Yoneda et al., 11 Feb 2026).
1. Problem Formulation and Objectives
SoftMatcha 2 addresses the "soft" pattern-matching problem on massive tokenized corpora, denoted as with %%%%1%%%% tokens. For a query sequence and a specified hit count , the goal is to return up to patterns occurring in such that the "soft" similarity for a threshold that may be tuned to fit .
"Softness" refers to allowing:
- Substitutions: Each is scored via the cosine similarity of their token embeddings; aggregated using a smooth-min operator with temperature :
where .
- Insertions and deletions: Extra (inserted or deleted) tokens incur a penalty factor , where is the whitened norm of the token embedding and is calibrated so that moderate edits cost roughly $1/e$ in similarity.
Performance targets include:
- Exact lookup: 0.4 ms (p95) on a 1.4T-token corpus
- Soft search: Top-20 hits in 0.3s (p95) at trillion-scale
2. Methodology: Indexing and Pruning
2.1 Disk-Aware Suffix Arrays
The corpus is indexed as a lexicographically sorted array , where each entry is a slice , . Direct binary search on is impractical due to the large number of random disk reads at SSD latency. To address this, SoftMatcha 2 employs a two-level index:
- Upper level (in RAM): Sparse array for block size –$256$
- Lower level (on disk): Complete array with run-length encoding compression
Lookup for a sequence is performed by a binary search in to identify the target block and exactly one random SSD page read for its block in . The cost is time with a single random disk access; the compressed index scales, e.g., from 60 TB to 21 TB for a 1.4T-token corpus.
2.2 Dynamic, Corpus-Aware Pruning
Exhaustive enumeration of all edit-distance variants for "soft" matches would result in exponential complexity . SoftMatcha 2 instead incorporates pruning interleaved with enumeration:
- At each query prefix length , only continuations that preserve similarity are generated.
- Iterative filtering: Immediate corpus filtering eliminates dead candidate branches.
- k-gram caching: Existence of frequent bi/trigrams is precomputed for in-memory lookup, avoiding unnecessary disk operations.
- Last-bits pruning: For rare prefixes (≤50 corpus occurrences), a direct sweep is performed rather than extending semantically and suffix-searching each possible continuation.
This brings the mean number of exact-lookups per query-token down from to per query token.
3. Theoretical Guarantees
Let be the set of all length- semantic variants with sufficient similarity, and the set that actually occur in . Define as the set of corpus -grams.
- Hypothesis 1:
- Hypothesis 2: There exists such that
Under these, two core results are established:
- No exponential explosion (Theorem 3.1):
The number of lookups does not grow with query length for fixed and .
- Sub-linear scaling in corpus size (Theorem 3.2):
If the -gram frequency in is Zipfian with exponent ,
This demonstrates that pruning exploits natural language's heavy-tailed distributions to avoid the naive blowup.
4. Empirical Evaluation
4.1 Experimental Setup
Experiments are conducted on AWS i4i.32xlarge instances (128 vCPUs, 1 TB RAM, 30 TB NVMe SSD) over:
- FineWeb-Edu: 1.375 T tokens (EN)
- C4-ja: 169 B (JA)
- C4-zh: 38.3 B (ZH)
- Additional languages: DE, FR, IT, RU (~1 B tokens each)
Tokenizations employ Moses+GloVe-300k (EN) and ICU+fastText-524k (others).
4.2 Performance Benchmarks
Exact Lookup Latency (p95):
| Corpus Size | infini-gram | SoftMatcha 2 |
|---|---|---|
| 273 B | 151.8 ms | 0.32 ms |
| 1.4 T | 11.05 ms | 0.34 ms |
SoftMatcha 2 outperforms infini-gram by –.
Soft Search Latency (p95):
| Corpus | SoftMatcha 1 | SoftMatcha 2 |
|---|---|---|
| 50 B EN | 0.15 s | 0.09 s |
| 500 B EN | 0.94 s | 0.16 s |
| 1.4 T EN | – (timeout) | 0.28 s |
On Japanese and Chinese, for SoftMatcha 2; SoftMatcha 1 fails above 50 B tokens.
Index Build Time and Disk Usage
| Index Method | Build Time | Index Size | Text Size |
|---|---|---|---|
| SoftMatcha 2 | 53.8 h | 21.6 TB | 6.7 TB |
| infini-gram | 61.2 h | 9.9 TB | – |
SoftMatcha 2 handles multilingually indexed corpora where SoftMatcha 1 and infini-gram mini time out above 50 B tokens.
Pruning Effectiveness
| Pruning Strategy | Lookups per Token () |
|---|---|
| No Pruning | 5.17 |
| Iterative Only | ~3.2 |
| + k-gram | ~2.7 |
| + last-bits | 2.27 |
Benchmark Contamination Detection
SoftMatcha 2 flagged 36 additional contaminated benchmark samples (1.4%) beyond exact matches, with 81% manual precision: 18 semantic contaminations, 11 template leakages, and 7 false positives. Semantically relaxed queries identified cases (e.g., deletions, synonyms, template instantiations) invisible to exact match.
5. Practical Deployment and Applications
- Maximum Query Length: ; longer queries are truncated or rejected.
- Block Size: , corresponding to SSD page size.
- k-gram Tables: Top 100 K unigrams and sum-of-rank ≤10 K trigrams memory cached.
- last-bits Threshold: 50.
- Subword Support: Subword tokenizers (e.g., LLaMA-2) can be utilized, with custom handling for rare/case-sensitive splits; median latency is ms for 10B-token English.
A web-based demo supports seven languages via prebuilt indices and provides an interactive search interface (https://softmatcha.github.io/v2/) and open-source code (https://github.com/softmatcha/softmatcha2).
SoftMatcha 2's detection of benchmark contamination extends beyond exact duplication, capturing semantic and templatic leakages otherwise undetectable.
6. Limitations and Prospects
Current limitations include:
- Semantic variation is restricted to single-word substitutions; multiword paraphrases (e.g., "U.S." "United States") are not detected.
- Embedding-based similarity may miss rare or low-frequency synonyms.
- The memory and disk footprint scales linearly with corpus size.
Potential extensions involve exploring compositional token embeddings, supporting wildcard and multiword pattern operators, tightening theoretical bounds for pruning, and automated multiword pattern generation (Yoneda et al., 11 Feb 2026).