- The paper proposes DINOv2, a self-supervised method that produces robust visual features for both image- and pixel-level tasks using a student-teacher framework with EMA.
- It employs a large, curated LVD-142M dataset along with technical innovations such as untying projection heads, Sinkhorn-Knopp normalization, and KoLeo regularization to boost performance.
- Results show DINOv2 outperforms prior SSL methods on benchmarks including ImageNet-1k, displaying strong generalization, enhanced retrieval metrics, and improved fairness analyses.
This paper introduces DINOv2, a self-supervised learning method for training computer vision models that produce robust, general-purpose visual features applicable across various tasks without requiring fine-tuning (DINOv2: Learning Robust Visual Features without Supervision, 2023). The goal is to create "foundation models" for vision, similar to those in NLP, capable of generating features useful for both image-level (e.g., classification) and pixel-level (e.g., segmentation) tasks directly.
The authors argue that while self-supervised learning (SSL) holds promise, prior work often suffered from either training on small curated datasets like ImageNet-1k, limiting feature generality, or on large uncurated datasets, which degraded feature quality. DINOv2 addresses this by combining advancements in SSL algorithms with a focus on large-scale training using a carefully curated dataset.
Methodology:
- Self-Supervised Learning Approach: DINOv2 builds upon discriminative SSL methods, particularly DINO and iBOT. It employs a student-teacher framework with an exponential moving average (EMA) teacher. The core objective combines:
- Image-level Objective (DINO): Matching the features of the global
[CLS]
token between student and teacher networks viewing different crops of the same image, using a cross-entropy loss.
- Patch-level Objective (iBOT): Randomly masking patches in the student's input and predicting the features of these masked patches using the corresponding unmasked patch features from the teacher, again with a cross-entropy loss.
- Technical Improvements: Several modifications are made to enhance stability and performance at scale:
- Untying the projection head weights between the image-level and patch-level objectives.
- Using Sinkhorn-Knopp batch normalization (from SwAV) for centering the teacher targets instead of softmax-centering.
- Adding a KoLeo regularizer to encourage uniform feature distribution within batches, improving retrieval tasks.
- Adapting image resolution by adding a short fine-tuning phase at a higher resolution (518x518) towards the end of training, benefiting dense prediction tasks without the full cost of high-res training.
- Data Curation (LVD-142M Dataset): Recognizing the importance of data quality and diversity, the authors developed an automatic pipeline to build a large (142 million images), diverse, and curated dataset (LVD-142M) from a vast pool of 1.2 billion uncurated web images. This pipeline avoids manual annotation and metadata reliance:
- It starts with multiple curated sources (ImageNet-1k/22k, Google Landmarks, etc.) and the uncurated web data.
- Uses a pretrained self-supervised ViT to compute image embeddings.
- Performs large-scale deduplication on the uncurated data using image similarity (Faiss library).
- Retrieves images from the deduplicated uncurated pool that are nearest neighbors (based on embedding similarity) to images in the curated sources, balancing concept representation.
- Efficient Implementation: Significant engineering effort focused on enabling large-scale training:
- Custom memory-efficient attention mechanism (similar to FlashAttention).
- Sequence packing to batch variable-length sequences (from large and small crops) efficiently.
- Efficient stochastic depth implementation that skips computations for dropped blocks.
- Fully-Sharded Data Parallel (FSDP) using PyTorch to distribute model states and optimizer states across GPUs, enabling training of billion-parameter models (ViT-g) and reducing communication overhead.
- Knowledge Distillation: Instead of training smaller models (ViT-S, ViT-B, ViT-L) from scratch, they are distilled from the largest trained model (ViT-g). This process uses the same student-teacher training setup but with the large ViT-g as a frozen teacher, yielding better performance than training from scratch.
Results and Evaluation:
- Performance: DINOv2 models significantly outperform previous SSL methods across a wide array of benchmarks, including ImageNet-1k linear probe (+4.2% over iBOT), various fine-grained classification tasks, video action recognition, instance retrieval (e.g., +41% mAP on Oxford-Hard vs iBOT), semantic segmentation, and depth estimation.
- Comparison to Weakly-Supervised Models: DINOv2 features match or exceed the performance of strong open-source weakly-supervised models like OpenCLIP (ViT-G/14) on many tasks, particularly dense prediction tasks where DINOv2 shows a clear advantage. For instance, DINOv2 ViT-g achieves 86.5% top-1 linear probe accuracy on ImageNet-1k, surpassing OpenCLIP ViT-G (86.2%).
- Robustness and Generalization: The models demonstrate strong robustness to domain shifts (ImageNet-A/R/Sketch) and generalize well qualitatively to out-of-distribution images for tasks like depth estimation and segmentation.
- Emergent Properties: PCA analysis on patch features reveals an unsupervised ability to segment foreground objects and identify corresponding semantic parts across different instances and even categories (e.g., bird wing matching plane wing).
- Ablation Studies: Validate the positive impact of the curated LVD-142M dataset, the specific loss components (KoLeo, iBOT MIM), the efficiency optimizations, scaling model size with data, knowledge distillation, and the high-resolution adaptation step.
- Fairness Analysis: Evaluation on Dollar Street and Casual Conversations datasets shows improved geographical/income fairness compared to prior SSL work (SEERv2) but persistent biases towards Western/high-income groups. No major harmful label associations were found based on gender, skintone, or age.
Conclusion:
DINOv2 demonstrates that self-supervised learning, when scaled effectively with large, curated datasets and efficient training techniques, can produce general-purpose visual features that rival or surpass those learned with weak supervision (like image-text pairs). These features work well "out-of-the-box" across diverse tasks without fine-tuning, paving the way for vision foundation models. The paper provides both the models and code, emphasizing the practical applicability of the research.