Prediction-Driven Top-K Jaccard Coefficient

Updated 3 January 2026

The paper introduces a dynamic refinement of Jaccard similarity by predicting optimal Top-K neighbors using neural models and a Sparse Differential Transformer.
It employs a supervised prediction task to adaptively determine neighborhood sizes, which significantly enhances robustness and discriminative power in clustering.
Empirical results on large-scale datasets demonstrate improved face clustering performance and generalization across multiple domains.

The prediction-driven Top-K Jaccard similarity coefficient is a dynamic, data-adaptive refinement of the traditional Jaccard approach for measuring pairwise relationships in face clustering graphs. Central to this methodology is the replacement of a static, globally fixed neighbor count with an individually predicted optimal Top-K for each node, determined through a supervised prediction task powered by neural models and further stabilized via a Sparse Differential Transformer (@@@@2@@@@). The resulting framework achieves increased discriminative power, improved robustness to noise, and state-of-the-art clustering performance on multiple large-scale datasets (Zhang et al., 27 Dec 2025).

1. Mathematical Foundations of the Top-K Jaccard Similarity

The classical Jaccard similarity for two sets $A$ and $B$ is given by:

$J(A, B) = \frac{|A \cap B|}{|A \cup B|}.$

In the context of face clustering, the sets $N_i$ and $N_j$ typically correspond to the K nearest neighbors of nodes $i$ and $j$ in the embedding space, measured, for instance, by cosine similarity:

$J(N_i, N_j) = \frac{|N_i \cap N_j|}{|N_i \cup N_j|}.$

In the prediction-driven Top-K extension, the neighbor count $K$ for each node is not fixed but predicted. If $\mathrm{Top}\text{-}K_i$ is the predicted number for node $i$ (rounded to $\hat{k}_i$ ), then

$N_i^{\hat k_i} = \{\text{the first } \hat k_i \text{ neighbors of } i\}$

$N_j^{\hat k_i} = \{\text{the first } \hat k_i \text{ neighbors of } j\}$

with the intersection $\mathcal M_{ij}^{\hat k_i} = N_i^{\hat k_i} \cap N_j^{\hat k_i}$ .

The prediction-driven Top-K Jaccard edge probability is then:

$\widetilde p_{ij} = \frac{\sum_{h\in \mathcal M_{ij}^{\hat k_i}} (\hat p_{ih} + \hat p_{hj})}{\sum_{h\in N_i^{\hat k_i}} \hat p_{ih} + \sum_{h\in N_j^{\hat k_i}} \hat p_{hj}}$

where $\hat p_{uv}$ is a normalized pairwise similarity, obtained via distance-to-probability transformation:

$a_{ij} = \frac{f_i \cdot f_j}{\|f_i\| \|f_j\|}, \quad d_{ij} = 2 - 2a_{ij}$

$p_{ij} = \frac{1}{1 + \exp(\delta d_{ij} + \epsilon)} \quad (\delta = 7.5, \epsilon = -5)$

$\hat p_{ij} = \frac{p_{ij}}{\sum_{k\in N_i} p_{ik}}$

This approach increases the reliability of similarity measurements by focusing on a purified, node-specific neighborhood (Zhang et al., 27 Dec 2025).

2. Data-Driven Prediction of Optimal Top-K

Rather than applying a fixed neighbor threshold, the optimal neighborhood size for each node is formalized as a supervised prediction problem. For node $i$ , the model considers the top- $K$ candidate neighbors (ranked by cosine similarity) and predicts a score $s_{i \to j} \in [0,1]$ for each candidate $j$ , approximating the likelihood that $j$ shares the same identity as $i$ .

During inference, the predicted Top-K is obtained by thresholding:

$\mathrm{Top}\text{-}K_i = \max \{j \mid s_{i\to j} \geq \eta\}$

where $\eta$ is typically set to 0.90.

The input to the predictor consists of the embedding sequence $X_i = [f_i; f_{i,1}; f_{i,2}; \cdots; f_{i,K}]$ and corresponding positional or relative distance encodings. These are projected into queries, keys, and values for Transformer-based processing.

Training employs a binary cross-entropy loss:

$\mathcal{L} = -\frac{1}{N}\sum_{i}\sum_{j=1}^K \left[ y_{ij}\log s_{i \to j} + (1 - y_{ij})\log(1 - s_{i \to j}) \right]$

with $y_{ij}$ denoting the ground-truth same-identity indicator. Training samples primarily target a window around the true decision boundary to maintain class balance (Zhang et al., 27 Dec 2025).

3. Transformer-Based Predictors and the Sparse Differential Transformer (SDT)

The initial prediction framework employs a vanilla Transformer architecture (three layers, eight self-attention heads, hidden size 1024) with standard self-attention. However, the vanilla Transformer tends to overvalue irrelevant relationships, introducing noise into the neighborhood prediction.

Differential Self-Attention

The attention computation is refined by splitting queries and keys, and applying a subtractive update:

$F_{\rm DIFF} = \left[ S(Q_1 K_1^\top/\sqrt{d}) - \lambda S(Q_2 K_2^\top/\sqrt{d}) \right] V$

where $S(\cdot)$ denotes softmax and $\lambda$ is learnable.

Sparse Differential Transformer (SDT)

Sparsity is imposed by masking all but the top-K entries per row:

$F_{\rm SDT} = \left[ S(M(Q_1 K_1^\top/\sqrt{d})) - \lambda S(M(Q_2 K_2^\top/\sqrt{d})) \right] V$

with $M(\cdot)$ denoting the Top-K mask. This structure masks weak key–query pairs, focusing the model on the most informative local relationships and improving resilience to noise.

Mixture-of-Experts SDT

MoE-SDT introduces a mixture over masks $M_{K-u}, M_K, M_{K+u}$ (with $u = 5$ ) and combines their outputs using learnable weights $(\alpha, \beta, \gamma)$ :

$\begin{aligned} F_{\rm MoE\text{-}SDT} =\;& \alpha [ S(M_{K-u}(Q_1 K_1^\top)) - \lambda S(M_{K-u}(Q_2 K_2^\top)) ] V \ +& \beta [ S(M_K(Q_1 K_1^\top)) - \lambda S(M_K(Q_2 K_2^\top)) ] V \ +& \gamma [ S(M_{K+u}(Q_1 K_1^\top)) - \lambda S(M_{K+u}(Q_2 K_2^\top)) ] V \end{aligned}$

This construct further mitigates prediction errors near the Top-K boundary (Zhang et al., 27 Dec 2025).

4. Integration Into Clustering Workflow

The integration of the prediction-driven Top-K Jaccard coefficient into face clustering follows these algorithmic steps:

Extract face embeddings $\{f_i\}$ .
For each $i$ , determine top- $K$ candidates via cosine similarity.
Process $[f_i; f_{i,1}; \ldots; f_{i,K}]$ using the SDT predictor, yielding scores $s_{i\to j}$ .
Compute $\mathrm{Top}\text{-}K_i = \max\{j \mid s_{i\to j} \geq \eta\}$ .
Round to $\hat{k}_i$ (e.g., nearest multiple of 10 up to $K$ ).
Build refined neighborhoods $N_i^{\hat k_i}$ and compute $\widetilde p_{ij}$ .
Threshold $\widetilde p_{ij}$ to construct a sparse adjacency matrix.
Apply the Map-Equation codec for community detection.

Pseudocode for the procedure is:

for i in 1…N:
  neighbors = topK_by_cosine(f_i, all_f, K)
  scores    = SDT_predictor([f_i; neighbors])
  TopK = max{ j | scores[j] ≥ η }
  ĥ = round_to_10(TopK) if TopK<K else K
  refined = neighbors[:ĥ]
  for j in refined:
    compute p̂_ij via (9),(7) and add to numerator/denominator of (11)
  build edges where ṗ_ij>τ_edge
cluster = MapEquation(adjacency)

(Zhang et al., 27 Dec 2025)

5. Empirical Results and Robustness

Evaluation on several benchmarks demonstrates the efficacy of the prediction-driven Top-K Jaccard approach.

MS-Celeb-1M Clustering:

Method	$F_P$ (584K)	$F_B$ (584K)	$F_P$ (5.21M)	$F_B$ (5.21M)
FC-ESER	95.28	93.85	89.40	88.80
Diff-Cluster	95.46	94.14	90.08	89.14

Sigmoid distance-transform yields +0.18% $F_P$ , +0.16% $F_B$ versus exponential; adding Top-K filtering yields +0.35% $F_P$ , +0.46% $F_B$ .

Transformer Variant Ablations:

Architecture	$F_P$	$F_B$
Vanilla Transformer	94.25	92.73
Vanilla+Top-K Mask	94.78	93.25
Differential (no mask)	95.05	93.59
Differential+Top-K (SDT)	95.34	93.93
MoE-SDT	95.46	94.14

Noisy similarity matrix experiments simulate up to 40% noise: SDT maintains pairwise F-score, while vanilla Transformer degrades significantly.

Generalization evaluations confirm improvements across MSMT17 (person re-ID) and DeepFashion, as well as gains from substituting SDT into contemporary clustering pipelines (Zhang et al., 27 Dec 2025).

6. Impact and Significance

The prediction-driven Top-K Jaccard similarity coefficient constitutes an advancement in clustering methodology by dynamically adapting local connectivity based on learned data relationships rather than static heuristics. By defining neighborhood boundaries as a supervised prediction problem and employing SDT for robust selection, the approach produces neighborhood sets with enhanced purity, resulting in more reliable graph-based similarity estimation.

Empirical evidence establishes increases in overall clustering accuracy, robustness to random noise in similarity calculations, and improved performance generalization into domains beyond face clustering. A plausible implication is that prediction-driven Top-K filtering can be fruitfully adapted to related graph construction problems in metric learning and instance-level retrieval beyond the specific setting investigated.

In summary, this method advances the state of the art in graph-based face clustering with potential for transferability across adjacent domains (Zhang et al., 27 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Enhancing Noise Resilience in Face Clustering via Sparse Differential Transformer (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Prediction-Driven Top-K Jaccard Similarity Coefficient.

Prediction-Driven Top-K Jaccard Coefficient

1. Mathematical Foundations of the Top-K Jaccard Similarity

2. Data-Driven Prediction of Optimal Top-K

3. Transformer-Based Predictors and the Sparse Differential Transformer (SDT)

Differential Self-Attention

Sparse Differential Transformer (SDT)

Mixture-of-Experts SDT

4. Integration Into Clustering Workflow

5. Empirical Results and Robustness

6. Impact and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Prediction-Driven Top-K Jaccard Coefficient

1. Mathematical Foundations of the Top-K Jaccard Similarity

2. Data-Driven Prediction of Optimal Top-K

3. Transformer-Based Predictors and the Sparse Differential Transformer (SDT)

Differential Self-Attention

Sparse Differential Transformer (SDT)

Mixture-of-Experts SDT

4. Integration Into Clustering Workflow

5. Empirical Results and Robustness

6. Impact and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research