- The paper introduces SoftMatcha, an efficient algorithm for "soft" (semantic) pattern matching in billion-scale text corpora.
- SoftMatcha leverages word embeddings for semantic similarity and inverted indexes for fast corpus search, blending the benefits of traditional and continuous methods.
- Evaluations show SoftMatcha searches billion-word corpora in under a second, effectively finding patterns beyond exact strings and applicable to diverse linguistic tasks.
This paper introduces a new algorithm called SoftMatcha for efficiently finding patterns in large collections of text, known as corpora. This is particularly useful in fields like natural language processing and corpus linguistics, where researchers analyze real language use.
Currently, tools like grep and KWIC (keyword-in-context) concordancers rely on exact string matching. This means they can't easily handle variations in spelling, word forms, or phrasing, which are common in natural language. Continuous approaches, like dense vector search, are often too broad, retrieving unrelated texts that happen to share similar topics.
SoftMatcha addresses these limitations by using word embeddings to achieve "soft" or semantic pattern matching. Word embeddings are numerical representations of words that capture their meaning. This allows the algorithm to find matches even when the exact words in the query don't appear in the text, but words with similar meanings do. The algorithm uses inverted indexes to quickly search through large amounts of text.
Here's how SoftMatcha works:
- Preprocessing: The corpus text is preprocessed to create an inverted index. An inverted index maps each unique word in the corpus to its locations within the text. For example, an index might show that the word "example" appears at positions 10, 50, and 120 in the corpus.
- Matching Step 1: Softening the pattern: The search pattern (the phrase you're looking for) is "softened" using word embeddings. For each word in the pattern, the algorithm identifies other words in the vocabulary that have similar embeddings (i.e., similar meanings). The degree of similarity is determined by a threshold. For example, if you're searching for "the jazz musician", the algorithm might identify "a", "this" as similar to "the"; "blues", "funk" as similar to "jazz"; and "singer", "pianist" as similar to "musician".
- Matching Step 2: Finding Soft Matches:
- 2-1. Getting the soft inverted index: Using the inverted index of the corpus, the algorithm retrieves the positions of all the words that are similar to each word in the search pattern. So, for "the jazz musician", it retrieves the positions of "the", "a", "this", "jazz", "blues", "funk", "musician", "singer", and "pianist".
- 2-2. Finding the soft matches: The algorithm then looks for places in the text where the words from the softened pattern appear in the correct sequence. For example, if "a" appears at position 7, "jazz" appears at position 8, and "pianist" appears at position 9, this would be considered a soft match for "the jazz musician".
The algorithm's efficiency comes from performing the "soft" comparison of word embeddings only on the vocabulary (the set of unique words in the corpus), which is typically much smaller than the entire corpus. The algorithm then uses the inverted index to quickly find the positions of these similar words in the corpus.
The algorithm was tested on large English and Japanese text collections. The tests showed that SoftMatcha can search billions of words in less than a second, which is as fast as exact string matching and dense vector search methods. The algorithm was also effective at finding harmful content and analyzing linguistic patterns in Latin, a language with complex word forms.
Here are some of the key benefits of SoftMatcha:
- It can find matches even when the exact words in the query don't appear in the text, which makes it more flexible than traditional pattern-matching tools.
- It is efficient and can search billions of words in less than a second.
- It can be used for a variety of tasks, including finding harmful content and analyzing linguistic patterns.
Here are some of the technical details of the algorithm:
The researchers also provide a web demo of SoftMatcha that you can use to try out the algorithm yourself.
In summary, SoftMatcha is a new and efficient algorithm for soft pattern matching that can be used for a variety of tasks in natural language processing and corpus linguistics.