SoftMatcha: A Soft and Fast Pattern Matcher for Billion-Scale Corpus Searches (2503.03703v1)

Published 5 Mar 2025 in cs.CL

Abstract: Researchers and practitioners in natural language processing and computational linguistics frequently observe and analyze the real language usage in large-scale corpora. For that purpose, they often employ off-the-shelf pattern-matching tools, such as grep, and keyword-in-context concordancers, which is widely used in corpus linguistics for gathering examples. Nonetheless, these existing techniques rely on surface-level string matching, and thus they suffer from the major limitation of not being able to handle orthographic variations and paraphrasing -- notable and common phenomena in any natural language. In addition, existing continuous approaches such as dense vector search tend to be overly coarse, often retrieving texts that are unrelated but share similar topics. Given these challenges, we propose a novel algorithm that achieves \emph{soft} (or semantic) yet efficient pattern matching by relaxing a surface-level matching with word embeddings. Our algorithm is highly scalable with respect to the size of the corpus text utilizing inverted indexes. We have prepared an efficient implementation, and we provide an accessible web tool. Our experiments demonstrate that the proposed method (i) can execute searches on billion-scale corpora in less than a second, which is comparable in speed to surface-level string matching and dense vector search; (ii) can extract harmful instances that semantically match queries from a large set of English and Japanese Wikipedia articles; and (iii) can be effectively applied to corpus-linguistic analyses of Latin, a language with highly diverse inflections.

Summary

The paper introduces SoftMatcha, an efficient algorithm for "soft" (semantic) pattern matching in billion-scale text corpora.
SoftMatcha leverages word embeddings for semantic similarity and inverted indexes for fast corpus search, blending the benefits of traditional and continuous methods.
Evaluations show SoftMatcha searches billion-word corpora in under a second, effectively finding patterns beyond exact strings and applicable to diverse linguistic tasks.

This paper introduces a new algorithm called SoftMatcha for efficiently finding patterns in large collections of text, known as corpora. This is particularly useful in fields like natural language processing and corpus linguistics, where researchers analyze real language use.

Currently, tools like grep and KWIC (keyword-in-context) concordancers rely on exact string matching. This means they can't easily handle variations in spelling, word forms, or phrasing, which are common in natural language. Continuous approaches, like dense vector search, are often too broad, retrieving unrelated texts that happen to share similar topics.

SoftMatcha addresses these limitations by using word embeddings to achieve "soft" or semantic pattern matching. Word embeddings are numerical representations of words that capture their meaning. This allows the algorithm to find matches even when the exact words in the query don't appear in the text, but words with similar meanings do. The algorithm uses inverted indexes to quickly search through large amounts of text.

Here's how SoftMatcha works:

Preprocessing: The corpus text is preprocessed to create an inverted index. An inverted index maps each unique word in the corpus to its locations within the text. For example, an index might show that the word "example" appears at positions 10, 50, and 120 in the corpus.
Matching Step 1: Softening the pattern: The search pattern (the phrase you're looking for) is "softened" using word embeddings. For each word in the pattern, the algorithm identifies other words in the vocabulary that have similar embeddings (i.e., similar meanings). The degree of similarity is determined by a threshold. For example, if you're searching for "the jazz musician", the algorithm might identify "a", "this" as similar to "the"; "blues", "funk" as similar to "jazz"; and "singer", "pianist" as similar to "musician".
Matching Step 2: Finding Soft Matches:
- 2-1. Getting the soft inverted index: Using the inverted index of the corpus, the algorithm retrieves the positions of all the words that are similar to each word in the search pattern. So, for "the jazz musician", it retrieves the positions of "the", "a", "this", "jazz", "blues", "funk", "musician", "singer", and "pianist".
- 2-2. Finding the soft matches: The algorithm then looks for places in the text where the words from the softened pattern appear in the correct sequence. For example, if "a" appears at position 7, "jazz" appears at position 8, and "pianist" appears at position 9, this would be considered a soft match for "the jazz musician".

The algorithm's efficiency comes from performing the "soft" comparison of word embeddings only on the vocabulary (the set of unique words in the corpus), which is typically much smaller than the entire corpus. The algorithm then uses the inverted index to quickly find the positions of these similar words in the corpus.

The algorithm was tested on large English and Japanese text collections. The tests showed that SoftMatcha can search billions of words in less than a second, which is as fast as exact string matching and dense vector search methods. The algorithm was also effective at finding harmful content and analyzing linguistic patterns in Latin, a language with complex word forms.

Here are some of the key benefits of SoftMatcha:

It can find matches even when the exact words in the query don't appear in the text, which makes it more flexible than traditional pattern-matching tools.
It is efficient and can search billions of words in less than a second.
It can be used for a variety of tasks, including finding harmful content and analyzing linguistic patterns.

Here are some of the technical details of the algorithm:

The soft equivalence between words is defined using the cosine similarity of their word embeddings. The cosine similarity measures the angle between two vectors; a value of 1 indicates perfect similarity, while a value of 0 indicates no similarity. The formula is:

$\cos(e, e') = \frac{e \cdot e'}{\|e\| \|e'\|}$

where
- $e$ and $e'$ are the word embeddings of two words
- $e \cdot e'$ is the dot product of the two embeddings, which is calculated by multiplying corresponding components of the vectors and summing the results
- $\|e\|$ and $\|e'\|$ are the magnitudes (lengths) of the embeddings, which is calculated as the square root of the sum of the squares of the components of the vector.
The algorithm uses an inverted index to quickly find the positions of words in the corpus.
The time complexity of the algorithm is $O(n \times L + K)$ , where $n$ is the length of the search pattern, $L$ is the size of the vocabulary, and $K$ is the total size of the soft inverted index (the number of candidate positions for matches).

The researchers also provide a web demo of SoftMatcha that you can use to try out the algorithm yourself.

In summary, SoftMatcha is a new and efficient algorithm for soft pattern matching that can be used for a variety of tasks in natural language processing and corpus linguistics.

PDF Markdown

Tweets

https://twitter.com/_onionesque/status/1897554767561302225

https://twitter.com/arxivsanitybot/status/1897642480712966150

SoftMatcha: A Soft and Fast Pattern Matcher for Billion-Scale Corpus Searches (2503.03703v1)

Summary

Related Papers

Tweets