DINO-vMF: Self-Supervised vMF Learning

Updated 5 February 2026

DINO-vMF is a self-supervised framework that extends DINO by modeling assignment probabilities with a full von Mises–Fisher mixture, enabling per-cluster precision.
It introduces a principled log-normalization correction to softmax logits, allowing learned prototype norms and stable training at scale.
Empirical results demonstrate that DINO-vMF outperforms standard DINO and iBOT in image classification and downstream tasks, with enriched cluster utilization.

DINO-vMF is a self-supervised representation learning framework that generalizes DINO by reinterpreting its assignment probabilities as arising from a full von Mises–Fisher (vMF) mixture model, thereby offering adaptive per-cluster precision and improved prototype utilization. It introduces a principled log-normalization correction to the softmax logits, allowing the system to learn per-cluster sharpness and guaranteeing stable training at scale, particularly with unnormalized prototypes in large models. Empirical evidence demonstrates that DINO-vMF yields superior representation quality across a broad suite of downstream tasks, often tightly outperforming both original DINO and other DINO-derived variants such as iBOT (Govindarajan et al., 2024).

1. Probabilistic Interpretation as von Mises–Fisher Mixtures

Standard DINO computes assignment probabilities via the softmax of dot products between normalized sample embeddings and normalized prototypes. Formally, for $\ell_2$ -normalized representations $y_i$ and prototypes $w_k$ , the assignment probability is:

$P(k\mid x_i)=\frac{\exp(w_k^T\,y_i/\tau)}{\sum_{j=1}^K \exp(w_j^T\,y_i/\tau)},$

where $\tau$ is the sharpening temperature. When $\|w_k\|=1$ , this softmax can be interpreted as representing the posterior of a uniform mixture of von Mises–Fisher (vMF) distributions on the unit sphere:

$P(k\mid x_i) = \frac{\exp(\kappa\,\mu_k^T\,y_i)}{\sum_j \exp(\kappa\,\mu_j^T\,y_i)}$

with $\kappa=1/\tau$ and $\mu_k = w_k / \|w_k\|$ .

However, this formulation assumes all clusters share an equal, fixed concentration (precision) $\kappa$ , dictated solely by the temperature, thereby restricting the expressive capacity of the mixture model. In the general vMF mixture, each component $k$ would have its own precision $\kappa_k$ , and the responsibilities take the form:

$r_i^{(k)} = \frac{\pi_k\,C_p(\kappa_k)\,\exp(\kappa_k\,\mu_k^T y_i)}{\sum_j \pi_j\,C_p(\kappa_j)\,\exp(\kappa_j\,\mu_j^T y_i)}$

where $C_p(\kappa)$ is the normalization constant for the vMF density. DINO omits this term, implicitly limiting the model.

2. Log-Normalizer Correction and Assignment Probabilities

DINO-vMF restores fidelity to the mixture model by adding $\log C_p(\kappa_k)$ to each assignment logit. Specifically, for each prototype,

$\kappa_k = \frac{ \|w_k\| }{ \tau }$

and

$\log C_p(\kappa_k) = \left( \frac{p}{2} - 1 \right)\log \kappa_k - \frac{p}{2}\log(2\pi) - \log I_{p/2-1}(\kappa_k)$

where $I_{\nu}$ is the modified Bessel function of the first kind and $p$ is the embedding dimensionality.

The assignment probabilities become:

$P_{s}(k\mid x_i) \propto \exp\left( \frac{w_k^T y_i}{\tau} + \log C_p(\kappa_k) \right)$

for the student, and similarly for the teacher branch, adjusted by a centering bias $c_k$ . This correction guarantees that any increase in a prototype norm (i.e., cluster precision) must be justified by a commensurate fit to data points, enforcing balanced cluster assignment and preventing inflated assignment probabilities due to norm scaling alone.

3. Algorithmic Modifications

DINO-vMF deviates from DINO in three key prediction head aspects:

Prototype Norms: Prototypes $w_k$ are not $\ell_2$ -normalized; their learned magnitudes set per-cluster precisions.
Adaptive $\kappa_k$ and Log-Normalizer: Each cluster $k$ learns a separate $\kappa_k = \|w_k\| / \tau$ , and $\log C_p(\kappa_k)$ (approximated via the large- $p$ expansion) is added to the logits.
Augmented Softmax: The log-normalizer correction is included in both student and teacher softmax calculations.

All other architectural and optimization settings remain as in standard DINO/iBOT, with DINO-vMF not requiring forced normalization on final-layer prototypes to maintain training stability, even for large ViT-Base backbones.

Training Workflow Pseudocode

For each batch {x_b}:
    Obtain augmented views x_s, x_t
    Compute y_s = normalize(f(x_s)), y_t = normalize(f(x_t))
    For each prototype w_k:
        kappa_k = norm(w_k) / tau
        log_C_p = approximate_log_vMF_normalizer(kappa_k)
        l_s[k] = (w_k^T y_s) / tau + log_C_p  # student logits
        l_t[k] = (w_k^T y_t - c_k) / tau_t + log_C_p  # teacher logits
    P_s = softmax(l_s)
    P_t = softmax(l_t)
    loss = -sum_k P_t[k] * log P_s[k]
    Backpropagate, update student/teacher params and centering

4. Practical Implementation Considerations

Concentration Parameterization: All $\kappa_k$ are learned via prototype norms, removing the need for additional hyperparameters. Tying all $\kappa_k$ to a global scalar remains possible but is not required; per-cluster $\kappa_k$ enhances representational flexibility.
Log-Normalizer Approximation: The uniform asymptotic expansion of the Bessel function $I_\nu(\nu r)$ is used for efficient and accurate calculation, with negligible error (≤0.01) for typical embedding sizes ( $p=256$ , $\nu=127$ ).
Optimization and Hyperparameters: Temperatures, optimizer (AdamW), weight decay, multi-crop strategy, batch size, and learning rates replicate those from the DINO/iBOT canonical setups. The removal of $\ell_2$ -normalization on prototypes is central to the method’s stability and generality.

5. Empirical Performance and Cluster Utilization

Extensive pre-training on ImageNet-1K followed by evaluation on multiple downstream tasks demonstrates robust improvements:

ImageNet Classification

Method	kNN	Linear ([CLS])	Linear (DINO)
DINO	76.11	77.88	78.17
DINO-vMF	77.40	78.67	78.81
iBOT	77.10	79.33	79.50
iBOT-vMF	78.66	80.20	80.27

Few-Shot and Low-Data Regimes

Pretrain	1-shot	2-shot	5-shot	1% data
DINO	41.8	51.9	61.4	67.2
DINO-vMF	50.3	59.3	66.1	70.4
iBOT	46.0	56.0	64.7	69.9
iBOT-vMF	51.6	61.1	68.3	72.3

Prototype Utilization

Standard DINO with ViT-Base typically learns a “void” cluster capturing approximately 83% of prototypes (cosine similarity > 0.9), with limited effective cluster usage. DINO-vMF, in contrast, utilizes approximately 900 out of 65,536 prototypes uniquely and evenly, indicating that more clusters contribute to the representation, leading to richer and more informative embeddings.

Downstream Transfer

On an aggregated suite of nine small benchmarks (Aircraft, Caltech101, CIFAR-10/100, DTD, Flowers, Food, Pets, SUN397), as well as tasks in image retrieval (Oxford, Paris) and video object segmentation (DAVIS), DINO-vMF and iBOT-vMF consistently match or exceed the performance of their vanilla counterparts, often by 0.5–2.0 percentage points.

6. Theoretical and Practical Implications

Including $\log C_p(\kappa_k)$ in the logits enforces that a prototype cannot simply boost its cluster’s assignment likelihood by increasing its norm (and thus concentration) without compacting the assigned examples. This induces a trade-off between cluster tightness and coverage in the learned latent space, which mitigates prototype collapse (where most probes are assigned to very few clusters) and yields more semantically meaningful, widely distributed clusters.

A plausible implication is that removing the forced normalization and introducing vMF-based correction enables representation learning frameworks to scale robustly to wider, deeper models as well as higher numbers of clusters, with minimal risk of mode collapse or overfitting.

7. Summary and Impact

DINO-vMF provides a drop-in improvement over DINO and DINO-derived methods (e.g., iBOT) by explicitly aligning self-supervised learning assignments with the vMF mixture model. The algorithmic changes are concise: remove forced prototype normalization and add the (approximated) log-normalizer term to each logit. The impact is an enriched latent structure, stable and scalable training, and superior downstream transfer, particularly when evaluated on challenging or data-scarce settings. These advances highlight the significance of principled probabilistic corrections to clustering dynamics in self-distillation-based representation learning paradigms (Govindarajan et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

DINO as a von Mises-Fisher mixture model (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DINO-vMF.