SoftMatcha 2: Ultra-Fast Soft Pattern Matching

Updated 16 February 2026

SoftMatcha 2 is an ultra-fast algorithm enabling soft pattern matching in trillion-scale corpora with sub-millisecond exact lookup and sub-0.3s soft search.
It employs a two-level, disk-aware suffix array and dynamic, corpus-aware pruning to efficiently generate semantic variants with minimal lookup overhead.
Empirical evaluations demonstrate up to 475× performance improvement over previous methods, supporting large-scale multilingual text analyses.

SoftMatcha 2 is an ultra-fast, semantically relaxed pattern matching algorithm designed for interactive search over trillion-scale natural language corpora. It achieves sub-millisecond exact lookup and sub-0.3s soft search, supporting semantic variations such as substitutions, insertions, and deletions. The algorithm addresses the combinatorial challenges posed by semantic relaxation through disk-aware indexing and dynamic, corpus-aware pruning, enabling deployment across multi-terabyte text datasets and multilingual scenarios (Yoneda et al., 11 Feb 2026).

1. Problem Formulation and Objectives

SoftMatcha 2 addresses the "soft" pattern-matching problem on massive tokenized corpora, denoted as $C$ with %%%%1%%%% tokens. For a query sequence $q = (q_1, \ldots, q_m)$ and a specified hit count $K$ , the goal is to return up to $K$ patterns $p = (p_1, \ldots, p_\ell)$ occurring in $C$ such that the "soft" similarity $\operatorname{Sim}(q,p) \geq \alpha$ for a threshold $\alpha$ that may be tuned to fit $K$ .

"Softness" refers to allowing:

Substitutions: Each $q_i \leftrightarrow p_j$ is scored via the cosine similarity of their token embeddings; aggregated using a smooth-min operator with temperature $\tau$ :

$\operatorname{Sim}(q,p) = 1 - \log_\tau\left(\sum_{i=1}^m (\tau^{1-c_i} - 1) + 1\right)$

where $c_i = \cos(\mathrm{embed}(q_i), \mathrm{embed}(p_i))$ .

Insertions and deletions: Extra (inserted or deleted) tokens incur a penalty factor $\exp(-v/\lambda)$ , where $v$ is the whitened norm of the token embedding and $\lambda \approx m v_{50}/\log e$ is calibrated so that moderate edits cost roughly $1/e$ in similarity.

Performance targets include:

Exact lookup: $<$ 0.4 ms (p95) on a 1.4T-token corpus
Soft search: Top-20 hits in $<$ 0.3s (p95) at trillion-scale

2. Methodology: Indexing and Pruning

2.1 Disk-Aware Suffix Arrays

The corpus $C$ is indexed as a lexicographically sorted array $X[0..n-L]$ , where each entry is a slice $C[i..i+L-1]$ , $L \geq 12$ . Direct binary search on $X$ is impractical due to the large number of random disk reads at SSD latency. To address this, SoftMatcha 2 employs a two-level index:

Upper level (in RAM): Sparse array $Y[j] = X[jB]$ for block size $B \approx 128$ –$256$
Lower level (on disk): Complete array $X$ with run-length encoding compression

Lookup for a sequence $s$ is performed by a binary search in $Y$ to identify the target block and exactly one random SSD page read for its block in $X$ . The cost is $O(L \log n)$ time with a single random disk access; the compressed index scales, e.g., from 60 TB to 21 TB for a 1.4T-token corpus.

2.2 Dynamic, Corpus-Aware Pruning

Exhaustive enumeration of all edit-distance variants for "soft" matches would result in exponential complexity $O(V^k)$ . SoftMatcha 2 instead incorporates pruning interleaved with enumeration:

At each query prefix length $i$ , only continuations $w'$ that preserve similarity $\operatorname{Sim}(q_{1..i}, w \cdot w') \geq \alpha$ are generated.
Iterative filtering: Immediate corpus filtering eliminates dead candidate branches.
k-gram caching: Existence of frequent bi/trigrams is precomputed for in-memory lookup, avoiding unnecessary disk operations.
Last-bits pruning: For rare prefixes (≤50 corpus occurrences), a direct sweep is performed rather than extending semantically and suffix-searching each possible continuation.

This brings the mean number of exact-lookups per query-token down from $\sim 5\times$ to $\sim 2.3\times$ per query token.

3. Theoretical Guarantees

Let $S_i$ be the set of all length- $i$ semantic variants with sufficient similarity, and $L_i$ the set that actually occur in $C$ . Define $\Sigma_i$ as the set of corpus $i$ -grams.

Hypothesis 1: $\frac{|L_i|}{|S_i|} \approx \frac{|\Sigma_i|}{|V|^i}$
Hypothesis 2: There exists $r < 1$ such that $\frac{|S_i|}{|V|^i} = O(r^i)$

Under these, two core results are established:

No exponential explosion (Theorem 3.1):

$\mathbb{E}[\text{Total lookups}] = O(1) \ \text{in}\ m$

The number of lookups does not grow with query length $m$ for fixed $\alpha$ and $r$ .

Sub-linear scaling in corpus size (Theorem 3.2):

If the $i$ -gram frequency in $C$ is Zipfian with exponent $s > 1$ ,

$\mathbb{E}[\text{Total lookups}] = O(|C|^{1/s})$

This demonstrates that pruning exploits natural language's heavy-tailed distributions to avoid the naive $O(V^m)$ blowup.

4. Empirical Evaluation

4.1 Experimental Setup

Experiments are conducted on AWS i4i.32xlarge instances (128 vCPUs, 1 TB RAM, 30 TB NVMe SSD) over:

FineWeb-Edu: 1.375 T tokens (EN)
C4-ja: 169 B (JA)
C4-zh: 38.3 B (ZH)
Additional languages: DE, FR, IT, RU (~1 B tokens each)

Tokenizations employ Moses+GloVe-300k (EN) and ICU+fastText-524k (others).

4.2 Performance Benchmarks

Exact Lookup Latency (p95):

Corpus Size	infini-gram	SoftMatcha 2
273 B	151.8 ms	0.32 ms
1.4 T	11.05 ms	0.34 ms

SoftMatcha 2 outperforms infini-gram by $33\times$ – $475\times$ .

Soft Search Latency (p95):

Corpus	SoftMatcha 1	SoftMatcha 2
50 B EN	0.15 s	0.09 s
500 B EN	0.94 s	0.16 s
1.4 T EN	– (timeout)	0.28 s

On Japanese and Chinese, $p95 \leq 0.40\,\mathrm{s}$ for SoftMatcha 2; SoftMatcha 1 fails above 50 B tokens.

Index Build Time and Disk Usage

Index Method	Build Time	Index Size	Text Size
SoftMatcha 2	53.8 h	21.6 TB	6.7 TB
infini-gram	61.2 h	9.9 TB	–

SoftMatcha 2 handles multilingually indexed corpora where SoftMatcha 1 and infini-gram mini time out above 50 B tokens.

Pruning Effectiveness

Pruning Strategy	Lookups per Token ( $\times m$ )
No Pruning	5.17
Iterative Only	~3.2
+ k-gram	~2.7
+ last-bits	2.27

Benchmark Contamination Detection

SoftMatcha 2 flagged 36 additional contaminated benchmark samples (1.4%) beyond exact matches, with 81% manual precision: 18 semantic contaminations, 11 template leakages, and 7 false positives. Semantically relaxed queries identified cases (e.g., deletions, synonyms, template instantiations) invisible to exact match.

5. Practical Deployment and Applications

Maximum Query Length: $L=12$ ; longer queries are truncated or rejected.
Block Size: $B=128$ , corresponding to SSD page size.
k-gram Tables: Top 100 K unigrams and sum-of-rank ≤10 K trigrams memory cached.
last-bits Threshold: 50.
Subword Support: Subword tokenizers (e.g., LLaMA-2) can be utilized, with custom handling for rare/case-sensitive splits; median latency is $\approx58\,$ ms for 10B-token English.

A web-based demo supports seven languages via prebuilt indices and provides an interactive search interface (https://softmatcha.github.io/v2/) and open-source code (https://github.com/softmatcha/softmatcha2).

SoftMatcha 2's detection of benchmark contamination extends beyond exact duplication, capturing semantic and templatic leakages otherwise undetectable.

6. Limitations and Prospects

Current limitations include:

Semantic variation is restricted to single-word substitutions; multiword paraphrases (e.g., "U.S." $\leftrightarrow$ "United States") are not detected.
Embedding-based similarity may miss rare or low-frequency synonyms.
The memory and disk footprint scales linearly with corpus size.

Potential extensions involve exploring compositional token embeddings, supporting wildcard and multiword pattern operators, tightening theoretical bounds for pruning, and automated multiword pattern generation (Yoneda et al., 11 Feb 2026).

Markdown Upgrade to Chat

References (1)

SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale Corpora (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SoftMatcha 2.