Papers
Topics
Authors
Recent
Search
2000 character limit reached

SoftMatcha 2: Ultra-Fast Soft Pattern Matching

Updated 16 February 2026
  • SoftMatcha 2 is an ultra-fast algorithm enabling soft pattern matching in trillion-scale corpora with sub-millisecond exact lookup and sub-0.3s soft search.
  • It employs a two-level, disk-aware suffix array and dynamic, corpus-aware pruning to efficiently generate semantic variants with minimal lookup overhead.
  • Empirical evaluations demonstrate up to 475× performance improvement over previous methods, supporting large-scale multilingual text analyses.

SoftMatcha 2 is an ultra-fast, semantically relaxed pattern matching algorithm designed for interactive search over trillion-scale natural language corpora. It achieves sub-millisecond exact lookup and sub-0.3s soft search, supporting semantic variations such as substitutions, insertions, and deletions. The algorithm addresses the combinatorial challenges posed by semantic relaxation through disk-aware indexing and dynamic, corpus-aware pruning, enabling deployment across multi-terabyte text datasets and multilingual scenarios (Yoneda et al., 11 Feb 2026).

1. Problem Formulation and Objectives

SoftMatcha 2 addresses the "soft" pattern-matching problem on massive tokenized corpora, denoted as CC with n=Cn = |C| tokens. For a query sequence q=(q1,,qm)q = (q_1, \ldots, q_m) and a specified hit count KK, the goal is to return up to KK patterns p=(p1,,p)p = (p_1, \ldots, p_\ell) occurring in CC such that the "soft" similarity Sim(q,p)α\operatorname{Sim}(q,p) \geq \alpha for a threshold α\alpha that may be tuned to fit KK.

"Softness" refers to allowing:

  • Substitutions: Each n=Cn = |C|0 is scored via the cosine similarity of their token embeddings; aggregated using a smooth-min operator with temperature n=Cn = |C|1:

n=Cn = |C|2

where n=Cn = |C|3.

  • Insertions and deletions: Extra (inserted or deleted) tokens incur a penalty factor n=Cn = |C|4, where n=Cn = |C|5 is the whitened norm of the token embedding and n=Cn = |C|6 is calibrated so that moderate edits cost roughly n=Cn = |C|7 in similarity.

Performance targets include:

  • Exact lookup: n=Cn = |C|8 0.4 ms (p95) on a 1.4T-token corpus
  • Soft search: Top-20 hits in n=Cn = |C|9 0.3s (p95) at trillion-scale

2. Methodology: Indexing and Pruning

2.1 Disk-Aware Suffix Arrays

The corpus q=(q1,,qm)q = (q_1, \ldots, q_m)0 is indexed as a lexicographically sorted array q=(q1,,qm)q = (q_1, \ldots, q_m)1, where each entry is a slice q=(q1,,qm)q = (q_1, \ldots, q_m)2, q=(q1,,qm)q = (q_1, \ldots, q_m)3. Direct binary search on q=(q1,,qm)q = (q_1, \ldots, q_m)4 is impractical due to the large number of random disk reads at SSD latency. To address this, SoftMatcha 2 employs a two-level index:

  • Upper level (in RAM): Sparse array q=(q1,,qm)q = (q_1, \ldots, q_m)5 for block size q=(q1,,qm)q = (q_1, \ldots, q_m)6–q=(q1,,qm)q = (q_1, \ldots, q_m)7
  • Lower level (on disk): Complete array q=(q1,,qm)q = (q_1, \ldots, q_m)8 with run-length encoding compression

Lookup for a sequence q=(q1,,qm)q = (q_1, \ldots, q_m)9 is performed by a binary search in KK0 to identify the target block and exactly one random SSD page read for its block in KK1. The cost is KK2 time with a single random disk access; the compressed index scales, e.g., from 60 TB to 21 TB for a 1.4T-token corpus.

2.2 Dynamic, Corpus-Aware Pruning

Exhaustive enumeration of all edit-distance variants for "soft" matches would result in exponential complexity KK3. SoftMatcha 2 instead incorporates pruning interleaved with enumeration:

  • At each query prefix length KK4, only continuations KK5 that preserve similarity KK6 are generated.
  • Iterative filtering: Immediate corpus filtering eliminates dead candidate branches.
  • k-gram caching: Existence of frequent bi/trigrams is precomputed for in-memory lookup, avoiding unnecessary disk operations.
  • Last-bits pruning: For rare prefixes (≤50 corpus occurrences), a direct sweep is performed rather than extending semantically and suffix-searching each possible continuation.

This brings the mean number of exact-lookups per query-token down from KK7 to KK8 per query token.

3. Theoretical Guarantees

Let KK9 be the set of all length-KK0 semantic variants with sufficient similarity, and KK1 the set that actually occur in KK2. Define KK3 as the set of corpus KK4-grams.

  • Hypothesis 1: KK5
  • Hypothesis 2: There exists KK6 such that KK7

Under these, two core results are established:

  • No exponential explosion (Theorem 3.1):

KK8

The number of lookups does not grow with query length KK9 for fixed p=(p1,,p)p = (p_1, \ldots, p_\ell)0 and p=(p1,,p)p = (p_1, \ldots, p_\ell)1.

  • Sub-linear scaling in corpus size (Theorem 3.2):

If the p=(p1,,p)p = (p_1, \ldots, p_\ell)2-gram frequency in p=(p1,,p)p = (p_1, \ldots, p_\ell)3 is Zipfian with exponent p=(p1,,p)p = (p_1, \ldots, p_\ell)4,

p=(p1,,p)p = (p_1, \ldots, p_\ell)5

This demonstrates that pruning exploits natural language's heavy-tailed distributions to avoid the naive p=(p1,,p)p = (p_1, \ldots, p_\ell)6 blowup.

4. Empirical Evaluation

4.1 Experimental Setup

Experiments are conducted on AWS i4i.32xlarge instances (128 vCPUs, 1 TB RAM, 30 TB NVMe SSD) over:

  • FineWeb-Edu: 1.375 T tokens (EN)
  • C4-ja: 169 B (JA)
  • C4-zh: 38.3 B (ZH)
  • Additional languages: DE, FR, IT, RU (~1 B tokens each)

Tokenizations employ Moses+GloVe-300k (EN) and ICU+fastText-524k (others).

4.2 Performance Benchmarks

Exact Lookup Latency (p95):

Corpus Size infini-gram SoftMatcha 2
273 B 151.8 ms 0.32 ms
1.4 T 11.05 ms 0.34 ms

SoftMatcha 2 outperforms infini-gram by p=(p1,,p)p = (p_1, \ldots, p_\ell)7–p=(p1,,p)p = (p_1, \ldots, p_\ell)8.

Soft Search Latency (p95):

Corpus SoftMatcha 1 SoftMatcha 2
50 B EN 0.15 s 0.09 s
500 B EN 0.94 s 0.16 s
1.4 T EN – (timeout) 0.28 s

On Japanese and Chinese, p=(p1,,p)p = (p_1, \ldots, p_\ell)9 for SoftMatcha 2; SoftMatcha 1 fails above 50 B tokens.

Index Build Time and Disk Usage

Index Method Build Time Index Size Text Size
SoftMatcha 2 53.8 h 21.6 TB 6.7 TB
infini-gram 61.2 h 9.9 TB

SoftMatcha 2 handles multilingually indexed corpora where SoftMatcha 1 and infini-gram mini time out above 50 B tokens.

Pruning Effectiveness

Pruning Strategy Lookups per Token (CC0)
No Pruning 5.17
Iterative Only ~3.2
+ k-gram ~2.7
+ last-bits 2.27

Benchmark Contamination Detection

SoftMatcha 2 flagged 36 additional contaminated benchmark samples (1.4%) beyond exact matches, with 81% manual precision: 18 semantic contaminations, 11 template leakages, and 7 false positives. Semantically relaxed queries identified cases (e.g., deletions, synonyms, template instantiations) invisible to exact match.

5. Practical Deployment and Applications

  • Maximum Query Length: CC1; longer queries are truncated or rejected.
  • Block Size: CC2, corresponding to SSD page size.
  • k-gram Tables: Top 100 K unigrams and sum-of-rank ≤10 K trigrams memory cached.
  • last-bits Threshold: 50.
  • Subword Support: Subword tokenizers (e.g., LLaMA-2) can be utilized, with custom handling for rare/case-sensitive splits; median latency is CC3ms for 10B-token English.

A web-based demo supports seven languages via prebuilt indices and provides an interactive search interface (https://softmatcha.github.io/v2/) and open-source code (https://github.com/softmatcha/softmatcha2).

SoftMatcha 2's detection of benchmark contamination extends beyond exact duplication, capturing semantic and templatic leakages otherwise undetectable.

6. Limitations and Prospects

Current limitations include:

  • Semantic variation is restricted to single-word substitutions; multiword paraphrases (e.g., "U.S." CC4 "United States") are not detected.
  • Embedding-based similarity may miss rare or low-frequency synonyms.
  • The memory and disk footprint scales linearly with corpus size.

Potential extensions involve exploring compositional token embeddings, supporting wildcard and multiword pattern operators, tightening theoretical bounds for pruning, and automated multiword pattern generation (Yoneda et al., 11 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SoftMatcha 2.