Papers
Topics
Authors
Recent
Search
2000 character limit reached

DINO-vMF: Self-Supervised vMF Learning

Updated 5 February 2026
  • DINO-vMF is a self-supervised framework that extends DINO by modeling assignment probabilities with a full von Mises–Fisher mixture, enabling per-cluster precision.
  • It introduces a principled log-normalization correction to softmax logits, allowing learned prototype norms and stable training at scale.
  • Empirical results demonstrate that DINO-vMF outperforms standard DINO and iBOT in image classification and downstream tasks, with enriched cluster utilization.

DINO-vMF is a self-supervised representation learning framework that generalizes DINO by reinterpreting its assignment probabilities as arising from a full von Mises–Fisher (vMF) mixture model, thereby offering adaptive per-cluster precision and improved prototype utilization. It introduces a principled log-normalization correction to the softmax logits, allowing the system to learn per-cluster sharpness and guaranteeing stable training at scale, particularly with unnormalized prototypes in large models. Empirical evidence demonstrates that DINO-vMF yields superior representation quality across a broad suite of downstream tasks, often tightly outperforming both original DINO and other DINO-derived variants such as iBOT (Govindarajan et al., 2024).

1. Probabilistic Interpretation as von Mises–Fisher Mixtures

Standard DINO computes assignment probabilities via the softmax of dot products between normalized sample embeddings and normalized prototypes. Formally, for 2\ell_2-normalized representations yiy_i and prototypes wkw_k, the assignment probability is:

P(kxi)=exp(wkTyi/τ)j=1Kexp(wjTyi/τ),P(k\mid x_i)=\frac{\exp(w_k^T\,y_i/\tau)}{\sum_{j=1}^K \exp(w_j^T\,y_i/\tau)},

where τ\tau is the sharpening temperature. When wk=1\|w_k\|=1, this softmax can be interpreted as representing the posterior of a uniform mixture of von Mises–Fisher (vMF) distributions on the unit sphere:

P(kxi)=exp(κμkTyi)jexp(κμjTyi)P(k\mid x_i) = \frac{\exp(\kappa\,\mu_k^T\,y_i)}{\sum_j \exp(\kappa\,\mu_j^T\,y_i)}

with κ=1/τ\kappa=1/\tau and μk=wk/wk\mu_k = w_k / \|w_k\|.

However, this formulation assumes all clusters share an equal, fixed concentration (precision) κ\kappa, dictated solely by the temperature, thereby restricting the expressive capacity of the mixture model. In the general vMF mixture, each component kk would have its own precision κk\kappa_k, and the responsibilities take the form:

ri(k)=πkCp(κk)exp(κkμkTyi)jπjCp(κj)exp(κjμjTyi)r_i^{(k)} = \frac{\pi_k\,C_p(\kappa_k)\,\exp(\kappa_k\,\mu_k^T y_i)}{\sum_j \pi_j\,C_p(\kappa_j)\,\exp(\kappa_j\,\mu_j^T y_i)}

where Cp(κ)C_p(\kappa) is the normalization constant for the vMF density. DINO omits this term, implicitly limiting the model.

2. Log-Normalizer Correction and Assignment Probabilities

DINO-vMF restores fidelity to the mixture model by adding logCp(κk)\log C_p(\kappa_k) to each assignment logit. Specifically, for each prototype,

κk=wkτ\kappa_k = \frac{ \|w_k\| }{ \tau }

and

logCp(κk)=(p21)logκkp2log(2π)logIp/21(κk)\log C_p(\kappa_k) = \left( \frac{p}{2} - 1 \right)\log \kappa_k - \frac{p}{2}\log(2\pi) - \log I_{p/2-1}(\kappa_k)

where IνI_{\nu} is the modified Bessel function of the first kind and pp is the embedding dimensionality.

The assignment probabilities become:

Ps(kxi)exp(wkTyiτ+logCp(κk))P_{s}(k\mid x_i) \propto \exp\left( \frac{w_k^T y_i}{\tau} + \log C_p(\kappa_k) \right)

for the student, and similarly for the teacher branch, adjusted by a centering bias ckc_k. This correction guarantees that any increase in a prototype norm (i.e., cluster precision) must be justified by a commensurate fit to data points, enforcing balanced cluster assignment and preventing inflated assignment probabilities due to norm scaling alone.

3. Algorithmic Modifications

DINO-vMF deviates from DINO in three key prediction head aspects:

  • Prototype Norms: Prototypes wkw_k are not 2\ell_2-normalized; their learned magnitudes set per-cluster precisions.
  • Adaptive κk\kappa_k and Log-Normalizer: Each cluster kk learns a separate κk=wk/τ\kappa_k = \|w_k\| / \tau, and logCp(κk)\log C_p(\kappa_k) (approximated via the large-pp expansion) is added to the logits.
  • Augmented Softmax: The log-normalizer correction is included in both student and teacher softmax calculations.

All other architectural and optimization settings remain as in standard DINO/iBOT, with DINO-vMF not requiring forced normalization on final-layer prototypes to maintain training stability, even for large ViT-Base backbones.

Training Workflow Pseudocode

1
2
3
4
5
6
7
8
9
10
11
12
For each batch {x_b}:
    Obtain augmented views x_s, x_t
    Compute y_s = normalize(f(x_s)), y_t = normalize(f(x_t))
    For each prototype w_k:
        kappa_k = norm(w_k) / tau
        log_C_p = approximate_log_vMF_normalizer(kappa_k)
        l_s[k] = (w_k^T y_s) / tau + log_C_p  # student logits
        l_t[k] = (w_k^T y_t - c_k) / tau_t + log_C_p  # teacher logits
    P_s = softmax(l_s)
    P_t = softmax(l_t)
    loss = -sum_k P_t[k] * log P_s[k]
    Backpropagate, update student/teacher params and centering

4. Practical Implementation Considerations

  • Concentration Parameterization: All κk\kappa_k are learned via prototype norms, removing the need for additional hyperparameters. Tying all κk\kappa_k to a global scalar remains possible but is not required; per-cluster κk\kappa_k enhances representational flexibility.
  • Log-Normalizer Approximation: The uniform asymptotic expansion of the Bessel function Iν(νr)I_\nu(\nu r) is used for efficient and accurate calculation, with negligible error (≤0.01) for typical embedding sizes (p=256p=256, ν=127\nu=127).
  • Optimization and Hyperparameters: Temperatures, optimizer (AdamW), weight decay, multi-crop strategy, batch size, and learning rates replicate those from the DINO/iBOT canonical setups. The removal of 2\ell_2-normalization on prototypes is central to the method’s stability and generality.

5. Empirical Performance and Cluster Utilization

Extensive pre-training on ImageNet-1K followed by evaluation on multiple downstream tasks demonstrates robust improvements:

ImageNet Classification

Method kNN Linear ([CLS]) Linear (DINO)
DINO 76.11 77.88 78.17
DINO-vMF 77.40 78.67 78.81
iBOT 77.10 79.33 79.50
iBOT-vMF 78.66 80.20 80.27

Few-Shot and Low-Data Regimes

Pretrain 1-shot 2-shot 5-shot 1% data
DINO 41.8 51.9 61.4 67.2
DINO-vMF 50.3 59.3 66.1 70.4
iBOT 46.0 56.0 64.7 69.9
iBOT-vMF 51.6 61.1 68.3 72.3

Prototype Utilization

Standard DINO with ViT-Base typically learns a “void” cluster capturing approximately 83% of prototypes (cosine similarity > 0.9), with limited effective cluster usage. DINO-vMF, in contrast, utilizes approximately 900 out of 65,536 prototypes uniquely and evenly, indicating that more clusters contribute to the representation, leading to richer and more informative embeddings.

Downstream Transfer

On an aggregated suite of nine small benchmarks (Aircraft, Caltech101, CIFAR-10/100, DTD, Flowers, Food, Pets, SUN397), as well as tasks in image retrieval (Oxford, Paris) and video object segmentation (DAVIS), DINO-vMF and iBOT-vMF consistently match or exceed the performance of their vanilla counterparts, often by 0.5–2.0 percentage points.

6. Theoretical and Practical Implications

Including logCp(κk)\log C_p(\kappa_k) in the logits enforces that a prototype cannot simply boost its cluster’s assignment likelihood by increasing its norm (and thus concentration) without compacting the assigned examples. This induces a trade-off between cluster tightness and coverage in the learned latent space, which mitigates prototype collapse (where most probes are assigned to very few clusters) and yields more semantically meaningful, widely distributed clusters.

A plausible implication is that removing the forced normalization and introducing vMF-based correction enables representation learning frameworks to scale robustly to wider, deeper models as well as higher numbers of clusters, with minimal risk of mode collapse or overfitting.

7. Summary and Impact

DINO-vMF provides a drop-in improvement over DINO and DINO-derived methods (e.g., iBOT) by explicitly aligning self-supervised learning assignments with the vMF mixture model. The algorithmic changes are concise: remove forced prototype normalization and add the (approximated) log-normalizer term to each logit. The impact is an enriched latent structure, stable and scalable training, and superior downstream transfer, particularly when evaluated on challenging or data-scarce settings. These advances highlight the significance of principled probabilistic corrections to clustering dynamics in self-distillation-based representation learning paradigms (Govindarajan et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DINO-vMF.