DINO as a von Mises-Fisher mixture model (2405.10939v1)
Abstract: Self-distillation methods using Siamese networks are popular for self-supervised pre-training. DINO is one such method based on a cross-entropy loss between $K$-dimensional probability vectors, obtained by applying a softmax function to the dot product between representations and learnt prototypes. Given the fact that the learned representations are $L2$-normalized, we show that DINO and its derivatives, such as iBOT, can be interpreted as a mixture model of von Mises-Fisher components. With this interpretation, DINO assumes equal precision for all components when the prototypes are also $L2$-normalized. Using this insight we propose DINO-vMF, that adds appropriate normalization constants when computing the cluster assignment probabilities. Unlike DINO, DINO-vMF is stable also for the larger ViT-Base model with unnormalized prototypes. We show that the added flexibility of the mixture model is beneficial in terms of better image representations. The DINO-vMF pre-trained model consistently performs better than DINO on a range of downstream tasks. We obtain similar improvements for iBOT-vMF vs iBOT and thereby show the relevance of our proposed modification also for other methods derived from DINO.
- Self-labelling via simultaneous clustering and representation learning. In ICLR, 2020.
- Semi-supervised learning of visual features by non-parametrically predicting view assignments with support samples. In ICCV, 2021.
- Masked siamese networks for label-efficient learning. In ECCV, 2022.
- Clustering on the unit hypersphere using von Mises-Fisher distributions. JMLR, 2005.
- Beit: Bert pre-training of image transformers. In ICLR, 2021.
- Sparse mixture of von Mises-Fisher distribution. In ESANN, 2021.
- Food-101 – mining discriminative components with random forests. In ECCV, 2014.
- Signature verification using a" siamese" time delay neural network. NeurIPS, 1993.
- Deep clustering for unsupervised learning of visual features. In ECCV, 2018.
- Unsupervised pre-training of image features on non-curated data. In ICCV, 2019.
- Unsupervised learning of visual features by contrasting cluster assignments. NeurIPS, 2020.
- Emerging properties in self-supervised vision transformers. In ICCV, 2021.
- A stochastic approximation type em algorithm for the mixture problem. Stochastics: An International Journal of Probability and Stochastic Processes, 1992.
- A simple framework for contrastive learning of visual representations. In ICML, 2020a.
- Exploring simple siamese representation learning. In CVPR, 2021.
- Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020b.
- An empirical study of training self-supervised vision transformers. In ICCV, 2021.
- Describing textures in the wild. In CVPR, 2014.
- Convergence of a stochastic approximation version of the em algorithm. Annals of statistics, 1999.
- Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
- DLMF. NIST Digital Library of Mathematical Functions. http://dlmf.nist.gov/, Release 1.1.6 of 2022-06-30, 2022. URL http://dlmf.nist.gov/. F. W. J. Olver, A. B. Olde Daalhuis, D. W. Lozier, B. I. Schneider, R. F. Boisvert, C. W. Clark, B. R. Miller, B. V. Saunders, H. S. Cohl, and M. A. McClain, eds.
- Unsupervised visual representation learning by context prediction. In ICCV, 2015.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
- How well do self-supervised models transfer? In CVPR, 2021.
- Whitening for self-supervised representation learning. In ICML, 2021.
- von Mises-Fisher clustering models. In ICML, 2014.
- Bootstrap your own latent-a new approach to self-supervised learning. NeurIPS, 2020.
- von Mises-Fisher mixture model-based deep learning: Application to face verification. arXiv preprint arXiv:1706.04264, 2017.
- Deep residual learning for image recognition. In CVPR, 2016.
- Momentum contrast for unsupervised visual representation learning. In CVPR, 2020.
- Masked autoencoders are scalable vision learners. In CVPR, 2022.
- Segsort: Segmentation by discriminative sorting of segments. In ICCV, 2019.
- Space-time correspondence as a contrastive random walk. NeurIPS, 2020.
- Learning image representations by completing damaged jigsaw puzzles. In WACV, 2018.
- Unsupervised representation learning by predicting image rotations. In ICLR, 2018.
- Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
- Learning representations for automatic colorization. In ECCV, 2016.
- Compressive visual representations. NeurIPS, 2021.
- Efficient self-supervised vision transformers for representation learning. In ICLR, 2021.
- Caltech 101, 2022. URL https://data.caltech.edu/records/20086.
- Prototypical contrastive learning of unsupervised representations. In ICLR, 2020.
- Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
- Fixing weight decay regularization in adam. 2018.
- Fine-grained visual classification of aircraft. Technical report, 2013.
- Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representations. In CVPR, 2020.
- Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, 2008.
- Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.
- Cats and dogs. In CVPR, 2012.
- Beit v2: Masked image modeling with vector-quantized visual tokenizers. arXiv preprint arXiv:2208.06366, 2022.
- Object retrieval with large vocabularies and fast spatial matching. In CVPR, 2007.
- Lost in quantization: Improving particular object retrieval in large scale image databases. In CVPR, 2008.
- The 2017 davis challenge on video object segmentation. arXiv:1704.00675, 2017.
- Weight normalization: A simple reparameterization to accelerate training of deep neural networks. NeurIPS, 2016.
- What do we maximize in self-supervised learning? In First Workshop on Pre-training: Perspectives, Pitfalls, and Paths Forward at ICML, 2022.
- Bayesian estimation of the von-Mises Fisher mixture model with variational inference. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014.
- Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. NeurIPS, 2017.
- Contrastive multiview coding. In ECCV, 2020.
- Training data-efficient image transformers & distillation through attention. In ICML, 2021.
- Deit iii: Revenge of the vit. In ECCV, 2022.
- Attention is all you need. NeurIPS, 2017.
- Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In ICML, 2020.
- Unsupervised feature learning via non-parametric instance discrimination. In CVPR, 2018.
- Sun database: Large-scale scene recognition from abbey to zoo. In CVPR, 2010.
- Self-supervised learning with swin transformers. arXiv preprint arXiv:2105.04553, 2021.
- Prototype mixture models for few-shot semantic segmentation. In ECCV, 2020.
- Barlow twins: Self-supervised learning via redundancy reduction. In ICML, 2021.
- Colorful image colorization. In ECCV, 2016.
- Image bert pre-training with online tokenizer. In ICLR, 2021.
- Local aggregation for unsupervised learning of visual embeddings. In ICCV, 2019.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.