Papers
Topics
Authors
Recent
Search
2000 character limit reached

SoftMatcha 2: Ultra-Fast Soft Pattern Matching

Updated 16 February 2026
  • SoftMatcha 2 is an ultra-fast algorithm enabling soft pattern matching in trillion-scale corpora with sub-millisecond exact lookup and sub-0.3s soft search.
  • It employs a two-level, disk-aware suffix array and dynamic, corpus-aware pruning to efficiently generate semantic variants with minimal lookup overhead.
  • Empirical evaluations demonstrate up to 475× performance improvement over previous methods, supporting large-scale multilingual text analyses.

SoftMatcha 2 is an ultra-fast, semantically relaxed pattern matching algorithm designed for interactive search over trillion-scale natural language corpora. It achieves sub-millisecond exact lookup and sub-0.3s soft search, supporting semantic variations such as substitutions, insertions, and deletions. The algorithm addresses the combinatorial challenges posed by semantic relaxation through disk-aware indexing and dynamic, corpus-aware pruning, enabling deployment across multi-terabyte text datasets and multilingual scenarios (Yoneda et al., 11 Feb 2026).

1. Problem Formulation and Objectives

SoftMatcha 2 addresses the "soft" pattern-matching problem on massive tokenized corpora, denoted as CC with %%%%1%%%% tokens. For a query sequence q=(q1,,qm)q = (q_1, \ldots, q_m) and a specified hit count KK, the goal is to return up to KK patterns p=(p1,,p)p = (p_1, \ldots, p_\ell) occurring in CC such that the "soft" similarity Sim(q,p)α\operatorname{Sim}(q,p) \geq \alpha for a threshold α\alpha that may be tuned to fit KK.

"Softness" refers to allowing:

  • Substitutions: Each qipjq_i \leftrightarrow p_j is scored via the cosine similarity of their token embeddings; aggregated using a smooth-min operator with temperature τ\tau:

Sim(q,p)=1logτ(i=1m(τ1ci1)+1)\operatorname{Sim}(q,p) = 1 - \log_\tau\left(\sum_{i=1}^m (\tau^{1-c_i} - 1) + 1\right)

where ci=cos(embed(qi),embed(pi))c_i = \cos(\mathrm{embed}(q_i), \mathrm{embed}(p_i)).

  • Insertions and deletions: Extra (inserted or deleted) tokens incur a penalty factor exp(v/λ)\exp(-v/\lambda), where vv is the whitened norm of the token embedding and λmv50/loge\lambda \approx m v_{50}/\log e is calibrated so that moderate edits cost roughly $1/e$ in similarity.

Performance targets include:

  • Exact lookup: << 0.4 ms (p95) on a 1.4T-token corpus
  • Soft search: Top-20 hits in << 0.3s (p95) at trillion-scale

2. Methodology: Indexing and Pruning

2.1 Disk-Aware Suffix Arrays

The corpus CC is indexed as a lexicographically sorted array X[0..nL]X[0..n-L], where each entry is a slice C[i..i+L1]C[i..i+L-1], L12L \geq 12. Direct binary search on XX is impractical due to the large number of random disk reads at SSD latency. To address this, SoftMatcha 2 employs a two-level index:

  • Upper level (in RAM): Sparse array Y[j]=X[jB]Y[j] = X[jB] for block size B128B \approx 128–$256$
  • Lower level (on disk): Complete array XX with run-length encoding compression

Lookup for a sequence ss is performed by a binary search in YY to identify the target block and exactly one random SSD page read for its block in XX. The cost is O(Llogn)O(L \log n) time with a single random disk access; the compressed index scales, e.g., from 60 TB to 21 TB for a 1.4T-token corpus.

2.2 Dynamic, Corpus-Aware Pruning

Exhaustive enumeration of all edit-distance variants for "soft" matches would result in exponential complexity O(Vk)O(V^k). SoftMatcha 2 instead incorporates pruning interleaved with enumeration:

  • At each query prefix length ii, only continuations ww' that preserve similarity Sim(q1..i,ww)α\operatorname{Sim}(q_{1..i}, w \cdot w') \geq \alpha are generated.
  • Iterative filtering: Immediate corpus filtering eliminates dead candidate branches.
  • k-gram caching: Existence of frequent bi/trigrams is precomputed for in-memory lookup, avoiding unnecessary disk operations.
  • Last-bits pruning: For rare prefixes (≤50 corpus occurrences), a direct sweep is performed rather than extending semantically and suffix-searching each possible continuation.

This brings the mean number of exact-lookups per query-token down from 5×\sim 5\times to 2.3×\sim 2.3\times per query token.

3. Theoretical Guarantees

Let SiS_i be the set of all length-ii semantic variants with sufficient similarity, and LiL_i the set that actually occur in CC. Define Σi\Sigma_i as the set of corpus ii-grams.

  • Hypothesis 1: LiSiΣiVi\frac{|L_i|}{|S_i|} \approx \frac{|\Sigma_i|}{|V|^i}
  • Hypothesis 2: There exists r<1r < 1 such that SiVi=O(ri)\frac{|S_i|}{|V|^i} = O(r^i)

Under these, two core results are established:

  • No exponential explosion (Theorem 3.1):

E[Total lookups]=O(1) in m\mathbb{E}[\text{Total lookups}] = O(1) \ \text{in}\ m

The number of lookups does not grow with query length mm for fixed α\alpha and rr.

  • Sub-linear scaling in corpus size (Theorem 3.2):

If the ii-gram frequency in CC is Zipfian with exponent s>1s > 1,

E[Total lookups]=O(C1/s)\mathbb{E}[\text{Total lookups}] = O(|C|^{1/s})

This demonstrates that pruning exploits natural language's heavy-tailed distributions to avoid the naive O(Vm)O(V^m) blowup.

4. Empirical Evaluation

4.1 Experimental Setup

Experiments are conducted on AWS i4i.32xlarge instances (128 vCPUs, 1 TB RAM, 30 TB NVMe SSD) over:

  • FineWeb-Edu: 1.375 T tokens (EN)
  • C4-ja: 169 B (JA)
  • C4-zh: 38.3 B (ZH)
  • Additional languages: DE, FR, IT, RU (~1 B tokens each)

Tokenizations employ Moses+GloVe-300k (EN) and ICU+fastText-524k (others).

4.2 Performance Benchmarks

Exact Lookup Latency (p95):

Corpus Size infini-gram SoftMatcha 2
273 B 151.8 ms 0.32 ms
1.4 T 11.05 ms 0.34 ms

SoftMatcha 2 outperforms infini-gram by 33×33\times475×475\times.

Soft Search Latency (p95):

Corpus SoftMatcha 1 SoftMatcha 2
50 B EN 0.15 s 0.09 s
500 B EN 0.94 s 0.16 s
1.4 T EN – (timeout) 0.28 s

On Japanese and Chinese, p950.40sp95 \leq 0.40\,\mathrm{s} for SoftMatcha 2; SoftMatcha 1 fails above 50 B tokens.

Index Build Time and Disk Usage

Index Method Build Time Index Size Text Size
SoftMatcha 2 53.8 h 21.6 TB 6.7 TB
infini-gram 61.2 h 9.9 TB

SoftMatcha 2 handles multilingually indexed corpora where SoftMatcha 1 and infini-gram mini time out above 50 B tokens.

Pruning Effectiveness

Pruning Strategy Lookups per Token (×m\times m)
No Pruning 5.17
Iterative Only ~3.2
+ k-gram ~2.7
+ last-bits 2.27

Benchmark Contamination Detection

SoftMatcha 2 flagged 36 additional contaminated benchmark samples (1.4%) beyond exact matches, with 81% manual precision: 18 semantic contaminations, 11 template leakages, and 7 false positives. Semantically relaxed queries identified cases (e.g., deletions, synonyms, template instantiations) invisible to exact match.

5. Practical Deployment and Applications

  • Maximum Query Length: L=12L=12; longer queries are truncated or rejected.
  • Block Size: B=128B=128, corresponding to SSD page size.
  • k-gram Tables: Top 100 K unigrams and sum-of-rank ≤10 K trigrams memory cached.
  • last-bits Threshold: 50.
  • Subword Support: Subword tokenizers (e.g., LLaMA-2) can be utilized, with custom handling for rare/case-sensitive splits; median latency is 58\approx58\,ms for 10B-token English.

A web-based demo supports seven languages via prebuilt indices and provides an interactive search interface (https://softmatcha.github.io/v2/) and open-source code (https://github.com/softmatcha/softmatcha2).

SoftMatcha 2's detection of benchmark contamination extends beyond exact duplication, capturing semantic and templatic leakages otherwise undetectable.

6. Limitations and Prospects

Current limitations include:

  • Semantic variation is restricted to single-word substitutions; multiword paraphrases (e.g., "U.S." \leftrightarrow "United States") are not detected.
  • Embedding-based similarity may miss rare or low-frequency synonyms.
  • The memory and disk footprint scales linearly with corpus size.

Potential extensions involve exploring compositional token embeddings, supporting wildcard and multiword pattern operators, tightening theoretical bounds for pruning, and automated multiword pattern generation (Yoneda et al., 11 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SoftMatcha 2.