IDF-Weighted Keyword Overlap

Updated 12 December 2025

IDF-weighted keyword overlap is a method for quantifying text similarity by weighting token overlaps with inverse document frequency, emphasizing rare, informative terms.
It utilizes consistent-weighted sampling to extend traditional Jaccard similarity, enabling efficient, unbiased estimation over all text substrings.
The MONO framework optimizes index construction and query retrieval, achieving significant speed and storage improvements for near-duplicate detection and semantic search.

IDF-weighted keyword overlap refers to the quantification and approximation of semantic similarity between texts by weighting n-gram or token overlap according to the inverse document frequency (IDF) of each token. This approach generalizes the classic unweighted Jaccard similarity, integrating term frequency-inverse document frequency (TF-IDF) schemes prominent in information retrieval, and lies at the core of advanced text alignment and near-duplicate detection frameworks. Algorithms that accurately and efficiently estimate this measure for all substrings of large corpora enable robust detection of semantic similarity, down-weighting spurious matches on frequent vocabulary and emphasizing rare, information-carrying terms (Zhang et al., 30 Aug 2025).

1. Definition of IDF-Weighted Jaccard Similarity

IDF-weighted Jaccard similarity extends the classical set-based Jaccard metric by accounting for both token frequency and global discriminativity. Given token sets $A$ and $B$ , each token $t$ is assigned an inverse document frequency

$\operatorname{idf}(t) = \log\left(\frac{N}{N_t}\right)$

where $N$ is the corpus size and $N_t$ the number of documents containing $t$ . For a segment $A$ ,

$w_t^A = \operatorname{tf}(t, A) \cdot \operatorname{idf}(t)$

with $\operatorname{tf}(t, A)$ denoting the term frequency of $t$ in $A$ . The IDF-weighted Jaccard similarity is then

$J_w(A, B) = \frac{\sum_{t\in A\cap B}\min(w_t^A, w_t^B)}{\sum_{t\in A\cup B} \max(w_t^A, w_t^B)}$

This metric smoothly discounts ubiquitous tokens and accentuates rare, content-discriminative vocabulary. Because IDF weights are corpus-level constants, they can be incorporated into precomputed token weightings per segment (Zhang et al., 30 Aug 2025).

2. Consistent-Weighted Sampling (CWS) for Efficient Estimation

Classic min-hash sketches estimate unweighted Jaccard similarity and are insufficient for the weighted case. To extend to $J_w$ , improved consistent-weighted sampling (CWS) [Ioffe '10] is utilized. For each token $t$ , CWS associates three shared random variables: $r_t \sim \operatorname{Gamma}(2,1)$ , $c_t \sim \operatorname{Gamma}(2,1)$ , and $\beta_t \sim \operatorname{Uniform}(0,1)$ . For segment $A$ and $w_t^A$ ,

$\begin{align*} y_t &= \exp\left\{ r_t \cdot (\lfloor \log w_t^A / r_t + \beta_t \rfloor - \beta_t)\right\} \ a_t &= c_t / (y_t \cdot \exp(r_t)) \end{align*}$

The sketch $h(A)$ is taken as the triple $(t^*, y^*, a^*)$ where $a^* = \min_t a_t$ . With $k$ independent hash sketches,

$\mathbb{P}[ h_i(A) = h_i(B)] = J_w(A, B)$

$\widehat{J}_w(A, B) = \frac{1}{k} \sum_{i=1}^k 1\{ h_i(A) = h_i(B) \}$

Thus, the CWS-based sketch enables unbiased, sublinear estimation of IDF-weighted overlap in $O(1)$ time per nonzero token (Zhang et al., 30 Aug 2025).

3. Indexing and Querying All Substrings: MONO Framework

Indexing all $O(n^2)$ substrings under weighted Jaccard is intractable by exhaustive representation. The MONO framework efficiently builds and queries such indices using "compact windows": axis-aligned rectangles $[a..b]\times[c..d]$ of substring intervals $T[a..b]$ through $T[c..d]$ sharing identical CWS hash values.

Active keys are constructed for each token $t$ by processing all pairs $(p, q)$ , $T[p]=T[q]=t$ , considering the number of occurrences $x$ in $T[p..q]$ . Only pairs $(p,q)$ where the minimal hash is achieved for the first $x$ occurrences are retained. The number of active keys is $O(n + n\log f)$ in expectation, with $f$ as the maximal token frequency.

Compact windows are generated by sorting active keys and maintaining a 2D skyline to form non-overlapping rectangular partitions. For $k$ hashes, $k$ sets of windows are maintained in inverted indices, mapping hash values to tuples $(\text{T-ID}, a, b, c, d)$ (Zhang et al., 30 Aug 2025).

4. Optimal Index Construction and Theoretical Bounds

The MONO partitioning algorithm produces an expected $O(n + n\log f)$ windows per text, which is theoretically tight. Any hash-based method must produce at least $\Omega(n + n\log f)$ distinct windows in the worst case—demonstrated via coupon-collector arguments on adversarial texts consisting of repeated tokens in blocks of length $f$ —to guarantee sensitivity to all distinct weighted min-hash segmentations (Zhang et al., 30 Aug 2025).

This optimality ensures subquadratic space and index construction costs, a substantial improvement over previous approaches requiring $\Omega(nk)$ or worse.

5. Threshold Queries and Retrieval

The threshold-query mechanism proceeds in three steps:

Compute the $k$ -sketch of query $Q$ using CWS.
Retrieve $k$ inverted lists for the corresponding hash values $v_1,\dots,v_k$ .
Run a two-dimensional plane-sweep (IntervalScan) algorithm over the compact windows to identify all substrings $(i,j)$ covered by at least $\lceil k\theta \rceil$ lists—i.e., substrings whose estimated similarity exceeds threshold $\theta$ .

Each matching substring $T[i..j]$ is reported with polylogarithmic overhead in $n$ , plus the output size, ensuring fast query execution suitable for large-scale duplicate detection and semantic search (Zhang et al., 30 Aug 2025).

6. Empirical Results and Comparative Performance

On real-world corpora (PAN, OpenWebText), MONO achieves up to $26\times$ faster index construction than the AllAlign MultiSet-MinHash method, yields indices up to $30\%$ smaller, and improves query latency by up to $3\times$ . The practical $f$ values—maximum local term frequency per text—are typically small, making the extra $n\log f$ factor minor in operational settings (Zhang et al., 30 Aug 2025).

Approach	Index Construction	Index Size	Query Latency
MONO (CWS + windows)	up to 26× faster	30% smaller	up to 3× faster
MultiSet-MinHash (prev)	—	—	—

7. Context and Applications

IDF-weighted keyword overlap provides a principled, efficiently computable method for near-duplicate alignment and similarity search under realistic IR-weighted metrics. By combining TF-IDF weighting, consistent-weighted sampling sketches, and an optimal partitioning and index scheme, the MONO framework establishes the first guaranteed subquadratic solution for substring alignment in large corpora under weighted Jaccard similarity (Zhang et al., 30 Aug 2025). A plausible implication is broader applicability to tasks in document forensics, plagiarism detection, and semantic clustering, where approximate set similarity under importance-weighted scoring is critical.

PDF Markdown Chat (Pro)

References (1)

Near-Duplicate Text Alignment under Weighted Jaccard Similarity (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to IDF-Weighted Keyword Overlap.