Prediction-Driven Top-K Jaccard Coefficient
- The paper introduces a dynamic refinement of Jaccard similarity by predicting optimal Top-K neighbors using neural models and a Sparse Differential Transformer.
- It employs a supervised prediction task to adaptively determine neighborhood sizes, which significantly enhances robustness and discriminative power in clustering.
- Empirical results on large-scale datasets demonstrate improved face clustering performance and generalization across multiple domains.
The prediction-driven Top-K Jaccard similarity coefficient is a dynamic, data-adaptive refinement of the traditional Jaccard approach for measuring pairwise relationships in face clustering graphs. Central to this methodology is the replacement of a static, globally fixed neighbor count with an individually predicted optimal Top-K for each node, determined through a supervised prediction task powered by neural models and further stabilized via a Sparse Differential Transformer (SDT). The resulting framework achieves increased discriminative power, improved robustness to noise, and state-of-the-art clustering performance on multiple large-scale datasets (Zhang et al., 27 Dec 2025).
1. Mathematical Foundations of the Top-K Jaccard Similarity
The classical Jaccard similarity for two sets and is given by:
%%%%2%%%%
In the context of face clustering, the sets and typically correspond to the K nearest neighbors of nodes and in the embedding space, measured, for instance, by cosine similarity:
In the prediction-driven Top-K extension, the neighbor count for each node is not fixed but predicted. If is the predicted number for node (rounded to ), then
with the intersection .
The prediction-driven Top-K Jaccard edge probability is then:
where is a normalized pairwise similarity, obtained via distance-to-probability transformation:
This approach increases the reliability of similarity measurements by focusing on a purified, node-specific neighborhood (Zhang et al., 27 Dec 2025).
2. Data-Driven Prediction of Optimal Top-K
Rather than applying a fixed neighbor threshold, the optimal neighborhood size for each node is formalized as a supervised prediction problem. For node , the model considers the top- candidate neighbors (ranked by cosine similarity) and predicts a score for each candidate , approximating the likelihood that shares the same identity as .
During inference, the predicted Top-K is obtained by thresholding:
where is typically set to 0.90.
The input to the predictor consists of the embedding sequence and corresponding positional or relative distance encodings. These are projected into queries, keys, and values for Transformer-based processing.
Training employs a binary cross-entropy loss:
with denoting the ground-truth same-identity indicator. Training samples primarily target a window around the true decision boundary to maintain class balance (Zhang et al., 27 Dec 2025).
3. Transformer-Based Predictors and the Sparse Differential Transformer (SDT)
The initial prediction framework employs a vanilla Transformer architecture (three layers, eight self-attention heads, hidden size 1024) with standard self-attention. However, the vanilla Transformer tends to overvalue irrelevant relationships, introducing noise into the neighborhood prediction.
Differential Self-Attention
The attention computation is refined by splitting queries and keys, and applying a subtractive update:
where denotes softmax and is learnable.
Sparse Differential Transformer (SDT)
Sparsity is imposed by masking all but the top-K entries per row:
with denoting the Top-K mask. This structure masks weak key–query pairs, focusing the model on the most informative local relationships and improving resilience to noise.
Mixture-of-Experts SDT
MoE-SDT introduces a mixture over masks (with ) and combines their outputs using learnable weights :
This construct further mitigates prediction errors near the Top-K boundary (Zhang et al., 27 Dec 2025).
4. Integration Into Clustering Workflow
The integration of the prediction-driven Top-K Jaccard coefficient into face clustering follows these algorithmic steps:
- Extract face embeddings .
- For each , determine top- candidates via cosine similarity.
- Process using the SDT predictor, yielding scores .
- Compute .
- Round to (e.g., nearest multiple of 10 up to ).
- Build refined neighborhoods and compute .
- Threshold to construct a sparse adjacency matrix.
- Apply the Map-Equation codec for community detection.
Pseudocode for the procedure is:
1 2 3 4 5 6 7 8 9 10 |
for i in 1…N: neighbors = topK_by_cosine(f_i, all_f, K) scores = SDT_predictor([f_i; neighbors]) TopK = max{ j | scores[j] ≥ η } ĥ = round_to_10(TopK) if TopK<K else K refined = neighbors[:ĥ] for j in refined: compute p̂_ij via (9),(7) and add to numerator/denominator of (11) build edges where ṗ_ij>τ_edge cluster = MapEquation(adjacency) |
5. Empirical Results and Robustness
Evaluation on several benchmarks demonstrates the efficacy of the prediction-driven Top-K Jaccard approach.
MS-Celeb-1M Clustering:
| Method | (584K) | (584K) | (5.21M) | (5.21M) |
|---|---|---|---|---|
| FC-ESER | 95.28 | 93.85 | 89.40 | 88.80 |
| Diff-Cluster | 95.46 | 94.14 | 90.08 | 89.14 |
Sigmoid distance-transform yields +0.18% , +0.16% versus exponential; adding Top-K filtering yields +0.35% , +0.46% .
Transformer Variant Ablations:
| Architecture | ||
|---|---|---|
| Vanilla Transformer | 94.25 | 92.73 |
| Vanilla+Top-K Mask | 94.78 | 93.25 |
| Differential (no mask) | 95.05 | 93.59 |
| Differential+Top-K (SDT) | 95.34 | 93.93 |
| MoE-SDT | 95.46 | 94.14 |
Noisy similarity matrix experiments simulate up to 40% noise: SDT maintains pairwise F-score, while vanilla Transformer degrades significantly.
Generalization evaluations confirm improvements across MSMT17 (person re-ID) and DeepFashion, as well as gains from substituting SDT into contemporary clustering pipelines (Zhang et al., 27 Dec 2025).
6. Impact and Significance
The prediction-driven Top-K Jaccard similarity coefficient constitutes an advancement in clustering methodology by dynamically adapting local connectivity based on learned data relationships rather than static heuristics. By defining neighborhood boundaries as a supervised prediction problem and employing SDT for robust selection, the approach produces neighborhood sets with enhanced purity, resulting in more reliable graph-based similarity estimation.
Empirical evidence establishes increases in overall clustering accuracy, robustness to random noise in similarity calculations, and improved performance generalization into domains beyond face clustering. A plausible implication is that prediction-driven Top-K filtering can be fruitfully adapted to related graph construction problems in metric learning and instance-level retrieval beyond the specific setting investigated.
In summary, this method advances the state of the art in graph-based face clustering with potential for transferability across adjacent domains (Zhang et al., 27 Dec 2025).