Papers
Topics
Authors
Recent
2000 character limit reached

Prediction-Driven Top-K Jaccard Coefficient

Updated 3 January 2026
  • The paper introduces a dynamic refinement of Jaccard similarity by predicting optimal Top-K neighbors using neural models and a Sparse Differential Transformer.
  • It employs a supervised prediction task to adaptively determine neighborhood sizes, which significantly enhances robustness and discriminative power in clustering.
  • Empirical results on large-scale datasets demonstrate improved face clustering performance and generalization across multiple domains.

The prediction-driven Top-K Jaccard similarity coefficient is a dynamic, data-adaptive refinement of the traditional Jaccard approach for measuring pairwise relationships in face clustering graphs. Central to this methodology is the replacement of a static, globally fixed neighbor count with an individually predicted optimal Top-K for each node, determined through a supervised prediction task powered by neural models and further stabilized via a Sparse Differential Transformer (SDT). The resulting framework achieves increased discriminative power, improved robustness to noise, and state-of-the-art clustering performance on multiple large-scale datasets (Zhang et al., 27 Dec 2025).

1. Mathematical Foundations of the Top-K Jaccard Similarity

The classical Jaccard similarity for two sets AA and BB is given by:

%%%%2%%%%

In the context of face clustering, the sets NiN_i and NjN_j typically correspond to the K nearest neighbors of nodes ii and jj in the embedding space, measured, for instance, by cosine similarity:

J(Ni,Nj)=NiNjNiNj.J(N_i, N_j) = \frac{|N_i \cap N_j|}{|N_i \cup N_j|}.

In the prediction-driven Top-K extension, the neighbor count KK for each node is not fixed but predicted. If Top-Ki\mathrm{Top}\text{-}K_i is the predicted number for node ii (rounded to k^i\hat{k}_i), then

Nik^i={the first k^i neighbors of i}N_i^{\hat k_i} = \{\text{the first } \hat k_i \text{ neighbors of } i\}

Njk^i={the first k^i neighbors of j}N_j^{\hat k_i} = \{\text{the first } \hat k_i \text{ neighbors of } j\}

with the intersection Mijk^i=Nik^iNjk^i\mathcal M_{ij}^{\hat k_i} = N_i^{\hat k_i} \cap N_j^{\hat k_i}.

The prediction-driven Top-K Jaccard edge probability is then:

p~ij=hMijk^i(p^ih+p^hj)hNik^ip^ih+hNjk^ip^hj\widetilde p_{ij} = \frac{\sum_{h\in \mathcal M_{ij}^{\hat k_i}} (\hat p_{ih} + \hat p_{hj})}{\sum_{h\in N_i^{\hat k_i}} \hat p_{ih} + \sum_{h\in N_j^{\hat k_i}} \hat p_{hj}}

where p^uv\hat p_{uv} is a normalized pairwise similarity, obtained via distance-to-probability transformation:

aij=fifjfifj,dij=22aija_{ij} = \frac{f_i \cdot f_j}{\|f_i\| \|f_j\|}, \quad d_{ij} = 2 - 2a_{ij}

pij=11+exp(δdij+ϵ)(δ=7.5,ϵ=5)p_{ij} = \frac{1}{1 + \exp(\delta d_{ij} + \epsilon)} \quad (\delta = 7.5, \epsilon = -5)

p^ij=pijkNipik\hat p_{ij} = \frac{p_{ij}}{\sum_{k\in N_i} p_{ik}}

This approach increases the reliability of similarity measurements by focusing on a purified, node-specific neighborhood (Zhang et al., 27 Dec 2025).

2. Data-Driven Prediction of Optimal Top-K

Rather than applying a fixed neighbor threshold, the optimal neighborhood size for each node is formalized as a supervised prediction problem. For node ii, the model considers the top-KK candidate neighbors (ranked by cosine similarity) and predicts a score sij[0,1]s_{i \to j} \in [0,1] for each candidate jj, approximating the likelihood that jj shares the same identity as ii.

During inference, the predicted Top-K is obtained by thresholding:

Top-Ki=max{jsijη}\mathrm{Top}\text{-}K_i = \max \{j \mid s_{i\to j} \geq \eta\}

where η\eta is typically set to 0.90.

The input to the predictor consists of the embedding sequence Xi=[fi;fi,1;fi,2;;fi,K]X_i = [f_i; f_{i,1}; f_{i,2}; \cdots; f_{i,K}] and corresponding positional or relative distance encodings. These are projected into queries, keys, and values for Transformer-based processing.

Training employs a binary cross-entropy loss:

L=1Nij=1K[yijlogsij+(1yij)log(1sij)]\mathcal{L} = -\frac{1}{N}\sum_{i}\sum_{j=1}^K \left[ y_{ij}\log s_{i \to j} + (1 - y_{ij})\log(1 - s_{i \to j}) \right]

with yijy_{ij} denoting the ground-truth same-identity indicator. Training samples primarily target a window around the true decision boundary to maintain class balance (Zhang et al., 27 Dec 2025).

3. Transformer-Based Predictors and the Sparse Differential Transformer (SDT)

The initial prediction framework employs a vanilla Transformer architecture (three layers, eight self-attention heads, hidden size 1024) with standard self-attention. However, the vanilla Transformer tends to overvalue irrelevant relationships, introducing noise into the neighborhood prediction.

Differential Self-Attention

The attention computation is refined by splitting queries and keys, and applying a subtractive update:

FDIFF=[S(Q1K1/d)λS(Q2K2/d)]VF_{\rm DIFF} = \left[ S(Q_1 K_1^\top/\sqrt{d}) - \lambda S(Q_2 K_2^\top/\sqrt{d}) \right] V

where S()S(\cdot) denotes softmax and λ\lambda is learnable.

Sparse Differential Transformer (SDT)

Sparsity is imposed by masking all but the top-K entries per row:

FSDT=[S(M(Q1K1/d))λS(M(Q2K2/d))]VF_{\rm SDT} = \left[ S(M(Q_1 K_1^\top/\sqrt{d})) - \lambda S(M(Q_2 K_2^\top/\sqrt{d})) \right] V

with M()M(\cdot) denoting the Top-K mask. This structure masks weak key–query pairs, focusing the model on the most informative local relationships and improving resilience to noise.

Mixture-of-Experts SDT

MoE-SDT introduces a mixture over masks MKu,MK,MK+uM_{K-u}, M_K, M_{K+u} (with u=5u = 5) and combines their outputs using learnable weights (α,β,γ)(\alpha, \beta, \gamma):

FMoE-SDT=  α[S(MKu(Q1K1))λS(MKu(Q2K2))]V +β[S(MK(Q1K1))λS(MK(Q2K2))]V +γ[S(MK+u(Q1K1))λS(MK+u(Q2K2))]V\begin{aligned} F_{\rm MoE\text{-}SDT} =\;& \alpha [ S(M_{K-u}(Q_1 K_1^\top)) - \lambda S(M_{K-u}(Q_2 K_2^\top)) ] V \ +& \beta [ S(M_K(Q_1 K_1^\top)) - \lambda S(M_K(Q_2 K_2^\top)) ] V \ +& \gamma [ S(M_{K+u}(Q_1 K_1^\top)) - \lambda S(M_{K+u}(Q_2 K_2^\top)) ] V \end{aligned}

This construct further mitigates prediction errors near the Top-K boundary (Zhang et al., 27 Dec 2025).

4. Integration Into Clustering Workflow

The integration of the prediction-driven Top-K Jaccard coefficient into face clustering follows these algorithmic steps:

  1. Extract face embeddings {fi}\{f_i\}.
  2. For each ii, determine top-KK candidates via cosine similarity.
  3. Process [fi;fi,1;;fi,K][f_i; f_{i,1}; \ldots; f_{i,K}] using the SDT predictor, yielding scores sijs_{i\to j}.
  4. Compute Top-Ki=max{jsijη}\mathrm{Top}\text{-}K_i = \max\{j \mid s_{i\to j} \geq \eta\}.
  5. Round to k^i\hat{k}_i (e.g., nearest multiple of 10 up to KK).
  6. Build refined neighborhoods Nik^iN_i^{\hat k_i} and compute p~ij\widetilde p_{ij}.
  7. Threshold p~ij\widetilde p_{ij} to construct a sparse adjacency matrix.
  8. Apply the Map-Equation codec for community detection.

Pseudocode for the procedure is:

1
2
3
4
5
6
7
8
9
10
for i in 1N:
  neighbors = topK_by_cosine(f_i, all_f, K)
  scores    = SDT_predictor([f_i; neighbors])
  TopK = max{ j | scores[j]  η }
  ĥ = round_to_10(TopK) if TopK<K else K
  refined = neighbors[:ĥ]
  for j in refined:
    compute p̂_ij via (9),(7) and add to numerator/denominator of (11)
  build edges where ṗ_ij>τ_edge
cluster = MapEquation(adjacency)
(Zhang et al., 27 Dec 2025)

5. Empirical Results and Robustness

Evaluation on several benchmarks demonstrates the efficacy of the prediction-driven Top-K Jaccard approach.

MS-Celeb-1M Clustering:

Method FPF_P (584K) FBF_B (584K) FPF_P (5.21M) FBF_B (5.21M)
FC-ESER 95.28 93.85 89.40 88.80
Diff-Cluster 95.46 94.14 90.08 89.14

Sigmoid distance-transform yields +0.18% FPF_P, +0.16% FBF_B versus exponential; adding Top-K filtering yields +0.35% FPF_P, +0.46% FBF_B.

Transformer Variant Ablations:

Architecture FPF_P FBF_B
Vanilla Transformer 94.25 92.73
Vanilla+Top-K Mask 94.78 93.25
Differential (no mask) 95.05 93.59
Differential+Top-K (SDT) 95.34 93.93
MoE-SDT 95.46 94.14

Noisy similarity matrix experiments simulate up to 40% noise: SDT maintains pairwise F-score, while vanilla Transformer degrades significantly.

Generalization evaluations confirm improvements across MSMT17 (person re-ID) and DeepFashion, as well as gains from substituting SDT into contemporary clustering pipelines (Zhang et al., 27 Dec 2025).

6. Impact and Significance

The prediction-driven Top-K Jaccard similarity coefficient constitutes an advancement in clustering methodology by dynamically adapting local connectivity based on learned data relationships rather than static heuristics. By defining neighborhood boundaries as a supervised prediction problem and employing SDT for robust selection, the approach produces neighborhood sets with enhanced purity, resulting in more reliable graph-based similarity estimation.

Empirical evidence establishes increases in overall clustering accuracy, robustness to random noise in similarity calculations, and improved performance generalization into domains beyond face clustering. A plausible implication is that prediction-driven Top-K filtering can be fruitfully adapted to related graph construction problems in metric learning and instance-level retrieval beyond the specific setting investigated.

In summary, this method advances the state of the art in graph-based face clustering with potential for transferability across adjacent domains (Zhang et al., 27 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Prediction-Driven Top-K Jaccard Similarity Coefficient.