Papers
Topics
Authors
Recent
2000 character limit reached

IDF-Weighted Keyword Overlap

Updated 12 December 2025
  • IDF-weighted keyword overlap is a method for quantifying text similarity by weighting token overlaps with inverse document frequency, emphasizing rare, informative terms.
  • It utilizes consistent-weighted sampling to extend traditional Jaccard similarity, enabling efficient, unbiased estimation over all text substrings.
  • The MONO framework optimizes index construction and query retrieval, achieving significant speed and storage improvements for near-duplicate detection and semantic search.

IDF-weighted keyword overlap refers to the quantification and approximation of semantic similarity between texts by weighting n-gram or token overlap according to the inverse document frequency (IDF) of each token. This approach generalizes the classic unweighted Jaccard similarity, integrating term frequency-inverse document frequency (TF-IDF) schemes prominent in information retrieval, and lies at the core of advanced text alignment and near-duplicate detection frameworks. Algorithms that accurately and efficiently estimate this measure for all substrings of large corpora enable robust detection of semantic similarity, down-weighting spurious matches on frequent vocabulary and emphasizing rare, information-carrying terms (Zhang et al., 30 Aug 2025).

1. Definition of IDF-Weighted Jaccard Similarity

IDF-weighted Jaccard similarity extends the classical set-based Jaccard metric by accounting for both token frequency and global discriminativity. Given token sets AA and BB, each token tt is assigned an inverse document frequency

idf(t)=log(NNt)\operatorname{idf}(t) = \log\left(\frac{N}{N_t}\right)

where NN is the corpus size and NtN_t the number of documents containing tt. For a segment AA,

wtA=tf(t,A)idf(t)w_t^A = \operatorname{tf}(t, A) \cdot \operatorname{idf}(t)

with tf(t,A)\operatorname{tf}(t, A) denoting the term frequency of tt in AA. The IDF-weighted Jaccard similarity is then

Jw(A,B)=tABmin(wtA,wtB)tABmax(wtA,wtB)J_w(A, B) = \frac{\sum_{t\in A\cap B}\min(w_t^A, w_t^B)}{\sum_{t\in A\cup B} \max(w_t^A, w_t^B)}

This metric smoothly discounts ubiquitous tokens and accentuates rare, content-discriminative vocabulary. Because IDF weights are corpus-level constants, they can be incorporated into precomputed token weightings per segment (Zhang et al., 30 Aug 2025).

2. Consistent-Weighted Sampling (CWS) for Efficient Estimation

Classic min-hash sketches estimate unweighted Jaccard similarity and are insufficient for the weighted case. To extend to JwJ_w, improved consistent-weighted sampling (CWS) [Ioffe '10] is utilized. For each token tt, CWS associates three shared random variables: rtGamma(2,1)r_t \sim \operatorname{Gamma}(2,1), ctGamma(2,1)c_t \sim \operatorname{Gamma}(2,1), and βtUniform(0,1)\beta_t \sim \operatorname{Uniform}(0,1). For segment AA and wtAw_t^A,

yt=exp{rt(logwtA/rt+βtβt)} at=ct/(ytexp(rt))\begin{align*} y_t &= \exp\left\{ r_t \cdot (\lfloor \log w_t^A / r_t + \beta_t \rfloor - \beta_t)\right\} \ a_t &= c_t / (y_t \cdot \exp(r_t)) \end{align*}

The sketch h(A)h(A) is taken as the triple (t,y,a)(t^*, y^*, a^*) where a=mintata^* = \min_t a_t. With kk independent hash sketches,

P[hi(A)=hi(B)]=Jw(A,B)\mathbb{P}[ h_i(A) = h_i(B)] = J_w(A, B)

J^w(A,B)=1ki=1k1{hi(A)=hi(B)}\widehat{J}_w(A, B) = \frac{1}{k} \sum_{i=1}^k 1\{ h_i(A) = h_i(B) \}

Thus, the CWS-based sketch enables unbiased, sublinear estimation of IDF-weighted overlap in O(1)O(1) time per nonzero token (Zhang et al., 30 Aug 2025).

3. Indexing and Querying All Substrings: MONO Framework

Indexing all O(n2)O(n^2) substrings under weighted Jaccard is intractable by exhaustive representation. The MONO framework efficiently builds and queries such indices using "compact windows": axis-aligned rectangles [a..b]×[c..d][a..b]\times[c..d] of substring intervals T[a..b]T[a..b] through T[c..d]T[c..d] sharing identical CWS hash values.

Active keys are constructed for each token tt by processing all pairs (p,q)(p, q), T[p]=T[q]=tT[p]=T[q]=t, considering the number of occurrences xx in T[p..q]T[p..q]. Only pairs (p,q)(p,q) where the minimal hash is achieved for the first xx occurrences are retained. The number of active keys is O(n+nlogf)O(n + n\log f) in expectation, with ff as the maximal token frequency.

Compact windows are generated by sorting active keys and maintaining a 2D skyline to form non-overlapping rectangular partitions. For kk hashes, kk sets of windows are maintained in inverted indices, mapping hash values to tuples (T-ID,a,b,c,d)(\text{T-ID}, a, b, c, d) (Zhang et al., 30 Aug 2025).

4. Optimal Index Construction and Theoretical Bounds

The MONO partitioning algorithm produces an expected O(n+nlogf)O(n + n\log f) windows per text, which is theoretically tight. Any hash-based method must produce at least Ω(n+nlogf)\Omega(n + n\log f) distinct windows in the worst case—demonstrated via coupon-collector arguments on adversarial texts consisting of repeated tokens in blocks of length ff—to guarantee sensitivity to all distinct weighted min-hash segmentations (Zhang et al., 30 Aug 2025).

This optimality ensures subquadratic space and index construction costs, a substantial improvement over previous approaches requiring Ω(nk)\Omega(nk) or worse.

5. Threshold Queries and Retrieval

The threshold-query mechanism proceeds in three steps:

  1. Compute the kk-sketch of query QQ using CWS.
  2. Retrieve kk inverted lists for the corresponding hash values v1,,vkv_1,\dots,v_k.
  3. Run a two-dimensional plane-sweep (IntervalScan) algorithm over the compact windows to identify all substrings (i,j)(i,j) covered by at least kθ\lceil k\theta \rceil lists—i.e., substrings whose estimated similarity exceeds threshold θ\theta.

Each matching substring T[i..j]T[i..j] is reported with polylogarithmic overhead in nn, plus the output size, ensuring fast query execution suitable for large-scale duplicate detection and semantic search (Zhang et al., 30 Aug 2025).

6. Empirical Results and Comparative Performance

On real-world corpora (PAN, OpenWebText), MONO achieves up to 26×26\times faster index construction than the AllAlign MultiSet-MinHash method, yields indices up to 30%30\% smaller, and improves query latency by up to 3×3\times. The practical ff values—maximum local term frequency per text—are typically small, making the extra nlogfn\log f factor minor in operational settings (Zhang et al., 30 Aug 2025).

Approach Index Construction Index Size Query Latency
MONO (CWS + windows) up to 26× faster 30% smaller up to 3× faster
MultiSet-MinHash (prev)

7. Context and Applications

IDF-weighted keyword overlap provides a principled, efficiently computable method for near-duplicate alignment and similarity search under realistic IR-weighted metrics. By combining TF-IDF weighting, consistent-weighted sampling sketches, and an optimal partitioning and index scheme, the MONO framework establishes the first guaranteed subquadratic solution for substring alignment in large corpora under weighted Jaccard similarity (Zhang et al., 30 Aug 2025). A plausible implication is broader applicability to tasks in document forensics, plagiarism detection, and semantic clustering, where approximate set similarity under importance-weighted scoring is critical.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to IDF-Weighted Keyword Overlap.