IDF-Weighted Keyword Overlap
- IDF-weighted keyword overlap is a method for quantifying text similarity by weighting token overlaps with inverse document frequency, emphasizing rare, informative terms.
- It utilizes consistent-weighted sampling to extend traditional Jaccard similarity, enabling efficient, unbiased estimation over all text substrings.
- The MONO framework optimizes index construction and query retrieval, achieving significant speed and storage improvements for near-duplicate detection and semantic search.
IDF-weighted keyword overlap refers to the quantification and approximation of semantic similarity between texts by weighting n-gram or token overlap according to the inverse document frequency (IDF) of each token. This approach generalizes the classic unweighted Jaccard similarity, integrating term frequency-inverse document frequency (TF-IDF) schemes prominent in information retrieval, and lies at the core of advanced text alignment and near-duplicate detection frameworks. Algorithms that accurately and efficiently estimate this measure for all substrings of large corpora enable robust detection of semantic similarity, down-weighting spurious matches on frequent vocabulary and emphasizing rare, information-carrying terms (Zhang et al., 30 Aug 2025).
1. Definition of IDF-Weighted Jaccard Similarity
IDF-weighted Jaccard similarity extends the classical set-based Jaccard metric by accounting for both token frequency and global discriminativity. Given token sets and , each token is assigned an inverse document frequency
where is the corpus size and the number of documents containing . For a segment ,
with denoting the term frequency of in . The IDF-weighted Jaccard similarity is then
This metric smoothly discounts ubiquitous tokens and accentuates rare, content-discriminative vocabulary. Because IDF weights are corpus-level constants, they can be incorporated into precomputed token weightings per segment (Zhang et al., 30 Aug 2025).
2. Consistent-Weighted Sampling (CWS) for Efficient Estimation
Classic min-hash sketches estimate unweighted Jaccard similarity and are insufficient for the weighted case. To extend to , improved consistent-weighted sampling (CWS) [Ioffe '10] is utilized. For each token , CWS associates three shared random variables: , , and . For segment and ,
The sketch is taken as the triple where . With independent hash sketches,
Thus, the CWS-based sketch enables unbiased, sublinear estimation of IDF-weighted overlap in time per nonzero token (Zhang et al., 30 Aug 2025).
3. Indexing and Querying All Substrings: MONO Framework
Indexing all substrings under weighted Jaccard is intractable by exhaustive representation. The MONO framework efficiently builds and queries such indices using "compact windows": axis-aligned rectangles of substring intervals through sharing identical CWS hash values.
Active keys are constructed for each token by processing all pairs , , considering the number of occurrences in . Only pairs where the minimal hash is achieved for the first occurrences are retained. The number of active keys is in expectation, with as the maximal token frequency.
Compact windows are generated by sorting active keys and maintaining a 2D skyline to form non-overlapping rectangular partitions. For hashes, sets of windows are maintained in inverted indices, mapping hash values to tuples (Zhang et al., 30 Aug 2025).
4. Optimal Index Construction and Theoretical Bounds
The MONO partitioning algorithm produces an expected windows per text, which is theoretically tight. Any hash-based method must produce at least distinct windows in the worst case—demonstrated via coupon-collector arguments on adversarial texts consisting of repeated tokens in blocks of length —to guarantee sensitivity to all distinct weighted min-hash segmentations (Zhang et al., 30 Aug 2025).
This optimality ensures subquadratic space and index construction costs, a substantial improvement over previous approaches requiring or worse.
5. Threshold Queries and Retrieval
The threshold-query mechanism proceeds in three steps:
- Compute the -sketch of query using CWS.
- Retrieve inverted lists for the corresponding hash values .
- Run a two-dimensional plane-sweep (IntervalScan) algorithm over the compact windows to identify all substrings covered by at least lists—i.e., substrings whose estimated similarity exceeds threshold .
Each matching substring is reported with polylogarithmic overhead in , plus the output size, ensuring fast query execution suitable for large-scale duplicate detection and semantic search (Zhang et al., 30 Aug 2025).
6. Empirical Results and Comparative Performance
On real-world corpora (PAN, OpenWebText), MONO achieves up to faster index construction than the AllAlign MultiSet-MinHash method, yields indices up to smaller, and improves query latency by up to . The practical values—maximum local term frequency per text—are typically small, making the extra factor minor in operational settings (Zhang et al., 30 Aug 2025).
| Approach | Index Construction | Index Size | Query Latency |
|---|---|---|---|
| MONO (CWS + windows) | up to 26× faster | 30% smaller | up to 3× faster |
| MultiSet-MinHash (prev) | — | — | — |
7. Context and Applications
IDF-weighted keyword overlap provides a principled, efficiently computable method for near-duplicate alignment and similarity search under realistic IR-weighted metrics. By combining TF-IDF weighting, consistent-weighted sampling sketches, and an optimal partitioning and index scheme, the MONO framework establishes the first guaranteed subquadratic solution for substring alignment in large corpora under weighted Jaccard similarity (Zhang et al., 30 Aug 2025). A plausible implication is broader applicability to tasks in document forensics, plagiarism detection, and semantic clustering, where approximate set similarity under importance-weighted scoring is critical.