DINO-vMF: Self-Supervised vMF Learning
- DINO-vMF is a self-supervised framework that extends DINO by modeling assignment probabilities with a full von Mises–Fisher mixture, enabling per-cluster precision.
- It introduces a principled log-normalization correction to softmax logits, allowing learned prototype norms and stable training at scale.
- Empirical results demonstrate that DINO-vMF outperforms standard DINO and iBOT in image classification and downstream tasks, with enriched cluster utilization.
DINO-vMF is a self-supervised representation learning framework that generalizes DINO by reinterpreting its assignment probabilities as arising from a full von Mises–Fisher (vMF) mixture model, thereby offering adaptive per-cluster precision and improved prototype utilization. It introduces a principled log-normalization correction to the softmax logits, allowing the system to learn per-cluster sharpness and guaranteeing stable training at scale, particularly with unnormalized prototypes in large models. Empirical evidence demonstrates that DINO-vMF yields superior representation quality across a broad suite of downstream tasks, often tightly outperforming both original DINO and other DINO-derived variants such as iBOT (Govindarajan et al., 2024).
1. Probabilistic Interpretation as von Mises–Fisher Mixtures
Standard DINO computes assignment probabilities via the softmax of dot products between normalized sample embeddings and normalized prototypes. Formally, for -normalized representations and prototypes , the assignment probability is:
where is the sharpening temperature. When , this softmax can be interpreted as representing the posterior of a uniform mixture of von Mises–Fisher (vMF) distributions on the unit sphere:
with and .
However, this formulation assumes all clusters share an equal, fixed concentration (precision) , dictated solely by the temperature, thereby restricting the expressive capacity of the mixture model. In the general vMF mixture, each component would have its own precision , and the responsibilities take the form:
where is the normalization constant for the vMF density. DINO omits this term, implicitly limiting the model.
2. Log-Normalizer Correction and Assignment Probabilities
DINO-vMF restores fidelity to the mixture model by adding to each assignment logit. Specifically, for each prototype,
and
where is the modified Bessel function of the first kind and is the embedding dimensionality.
The assignment probabilities become:
for the student, and similarly for the teacher branch, adjusted by a centering bias . This correction guarantees that any increase in a prototype norm (i.e., cluster precision) must be justified by a commensurate fit to data points, enforcing balanced cluster assignment and preventing inflated assignment probabilities due to norm scaling alone.
3. Algorithmic Modifications
DINO-vMF deviates from DINO in three key prediction head aspects:
- Prototype Norms: Prototypes are not -normalized; their learned magnitudes set per-cluster precisions.
- Adaptive and Log-Normalizer: Each cluster learns a separate , and (approximated via the large- expansion) is added to the logits.
- Augmented Softmax: The log-normalizer correction is included in both student and teacher softmax calculations.
All other architectural and optimization settings remain as in standard DINO/iBOT, with DINO-vMF not requiring forced normalization on final-layer prototypes to maintain training stability, even for large ViT-Base backbones.
Training Workflow Pseudocode
1 2 3 4 5 6 7 8 9 10 11 12 |
For each batch {x_b}:
Obtain augmented views x_s, x_t
Compute y_s = normalize(f(x_s)), y_t = normalize(f(x_t))
For each prototype w_k:
kappa_k = norm(w_k) / tau
log_C_p = approximate_log_vMF_normalizer(kappa_k)
l_s[k] = (w_k^T y_s) / tau + log_C_p # student logits
l_t[k] = (w_k^T y_t - c_k) / tau_t + log_C_p # teacher logits
P_s = softmax(l_s)
P_t = softmax(l_t)
loss = -sum_k P_t[k] * log P_s[k]
Backpropagate, update student/teacher params and centering |
4. Practical Implementation Considerations
- Concentration Parameterization: All are learned via prototype norms, removing the need for additional hyperparameters. Tying all to a global scalar remains possible but is not required; per-cluster enhances representational flexibility.
- Log-Normalizer Approximation: The uniform asymptotic expansion of the Bessel function is used for efficient and accurate calculation, with negligible error (≤0.01) for typical embedding sizes (, ).
- Optimization and Hyperparameters: Temperatures, optimizer (AdamW), weight decay, multi-crop strategy, batch size, and learning rates replicate those from the DINO/iBOT canonical setups. The removal of -normalization on prototypes is central to the method’s stability and generality.
5. Empirical Performance and Cluster Utilization
Extensive pre-training on ImageNet-1K followed by evaluation on multiple downstream tasks demonstrates robust improvements:
ImageNet Classification
| Method | kNN | Linear ([CLS]) | Linear (DINO) |
|---|---|---|---|
| DINO | 76.11 | 77.88 | 78.17 |
| DINO-vMF | 77.40 | 78.67 | 78.81 |
| iBOT | 77.10 | 79.33 | 79.50 |
| iBOT-vMF | 78.66 | 80.20 | 80.27 |
Few-Shot and Low-Data Regimes
| Pretrain | 1-shot | 2-shot | 5-shot | 1% data |
|---|---|---|---|---|
| DINO | 41.8 | 51.9 | 61.4 | 67.2 |
| DINO-vMF | 50.3 | 59.3 | 66.1 | 70.4 |
| iBOT | 46.0 | 56.0 | 64.7 | 69.9 |
| iBOT-vMF | 51.6 | 61.1 | 68.3 | 72.3 |
Prototype Utilization
Standard DINO with ViT-Base typically learns a “void” cluster capturing approximately 83% of prototypes (cosine similarity > 0.9), with limited effective cluster usage. DINO-vMF, in contrast, utilizes approximately 900 out of 65,536 prototypes uniquely and evenly, indicating that more clusters contribute to the representation, leading to richer and more informative embeddings.
Downstream Transfer
On an aggregated suite of nine small benchmarks (Aircraft, Caltech101, CIFAR-10/100, DTD, Flowers, Food, Pets, SUN397), as well as tasks in image retrieval (Oxford, Paris) and video object segmentation (DAVIS), DINO-vMF and iBOT-vMF consistently match or exceed the performance of their vanilla counterparts, often by 0.5–2.0 percentage points.
6. Theoretical and Practical Implications
Including in the logits enforces that a prototype cannot simply boost its cluster’s assignment likelihood by increasing its norm (and thus concentration) without compacting the assigned examples. This induces a trade-off between cluster tightness and coverage in the learned latent space, which mitigates prototype collapse (where most probes are assigned to very few clusters) and yields more semantically meaningful, widely distributed clusters.
A plausible implication is that removing the forced normalization and introducing vMF-based correction enables representation learning frameworks to scale robustly to wider, deeper models as well as higher numbers of clusters, with minimal risk of mode collapse or overfitting.
7. Summary and Impact
DINO-vMF provides a drop-in improvement over DINO and DINO-derived methods (e.g., iBOT) by explicitly aligning self-supervised learning assignments with the vMF mixture model. The algorithmic changes are concise: remove forced prototype normalization and add the (approximated) log-normalizer term to each logit. The impact is an enriched latent structure, stable and scalable training, and superior downstream transfer, particularly when evaluated on challenging or data-scarce settings. These advances highlight the significance of principled probabilistic corrections to clustering dynamics in self-distillation-based representation learning paradigms (Govindarajan et al., 2024).