Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DINOv2: Learning Robust Visual Features without Supervision (2304.07193v2)

Published 14 Apr 2023 in cs.CV

Abstract: The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.

Citations (2,020)

Summary

  • The paper proposes DINOv2, a self-supervised method that produces robust visual features for both image- and pixel-level tasks using a student-teacher framework with EMA.
  • It employs a large, curated LVD-142M dataset along with technical innovations such as untying projection heads, Sinkhorn-Knopp normalization, and KoLeo regularization to boost performance.
  • Results show DINOv2 outperforms prior SSL methods on benchmarks including ImageNet-1k, displaying strong generalization, enhanced retrieval metrics, and improved fairness analyses.

This paper introduces DINOv2, a self-supervised learning method for training computer vision models that produce robust, general-purpose visual features applicable across various tasks without requiring fine-tuning (DINOv2: Learning Robust Visual Features without Supervision, 2023). The goal is to create "foundation models" for vision, similar to those in NLP, capable of generating features useful for both image-level (e.g., classification) and pixel-level (e.g., segmentation) tasks directly.

The authors argue that while self-supervised learning (SSL) holds promise, prior work often suffered from either training on small curated datasets like ImageNet-1k, limiting feature generality, or on large uncurated datasets, which degraded feature quality. DINOv2 addresses this by combining advancements in SSL algorithms with a focus on large-scale training using a carefully curated dataset.

Methodology:

  1. Self-Supervised Learning Approach: DINOv2 builds upon discriminative SSL methods, particularly DINO and iBOT. It employs a student-teacher framework with an exponential moving average (EMA) teacher. The core objective combines:
    • Image-level Objective (DINO): Matching the features of the global [CLS] token between student and teacher networks viewing different crops of the same image, using a cross-entropy loss.
    • Patch-level Objective (iBOT): Randomly masking patches in the student's input and predicting the features of these masked patches using the corresponding unmasked patch features from the teacher, again with a cross-entropy loss.
  2. Technical Improvements: Several modifications are made to enhance stability and performance at scale:
    • Untying the projection head weights between the image-level and patch-level objectives.
    • Using Sinkhorn-Knopp batch normalization (from SwAV) for centering the teacher targets instead of softmax-centering.
    • Adding a KoLeo regularizer to encourage uniform feature distribution within batches, improving retrieval tasks.
    • Adapting image resolution by adding a short fine-tuning phase at a higher resolution (518x518) towards the end of training, benefiting dense prediction tasks without the full cost of high-res training.
  3. Data Curation (LVD-142M Dataset): Recognizing the importance of data quality and diversity, the authors developed an automatic pipeline to build a large (142 million images), diverse, and curated dataset (LVD-142M) from a vast pool of 1.2 billion uncurated web images. This pipeline avoids manual annotation and metadata reliance:
    • It starts with multiple curated sources (ImageNet-1k/22k, Google Landmarks, etc.) and the uncurated web data.
    • Uses a pretrained self-supervised ViT to compute image embeddings.
    • Performs large-scale deduplication on the uncurated data using image similarity (Faiss library).
    • Retrieves images from the deduplicated uncurated pool that are nearest neighbors (based on embedding similarity) to images in the curated sources, balancing concept representation.
  4. Efficient Implementation: Significant engineering effort focused on enabling large-scale training:
    • Custom memory-efficient attention mechanism (similar to FlashAttention).
    • Sequence packing to batch variable-length sequences (from large and small crops) efficiently.
    • Efficient stochastic depth implementation that skips computations for dropped blocks.
    • Fully-Sharded Data Parallel (FSDP) using PyTorch to distribute model states and optimizer states across GPUs, enabling training of billion-parameter models (ViT-g) and reducing communication overhead.
  5. Knowledge Distillation: Instead of training smaller models (ViT-S, ViT-B, ViT-L) from scratch, they are distilled from the largest trained model (ViT-g). This process uses the same student-teacher training setup but with the large ViT-g as a frozen teacher, yielding better performance than training from scratch.

Results and Evaluation:

  • Performance: DINOv2 models significantly outperform previous SSL methods across a wide array of benchmarks, including ImageNet-1k linear probe (+4.2% over iBOT), various fine-grained classification tasks, video action recognition, instance retrieval (e.g., +41% mAP on Oxford-Hard vs iBOT), semantic segmentation, and depth estimation.
  • Comparison to Weakly-Supervised Models: DINOv2 features match or exceed the performance of strong open-source weakly-supervised models like OpenCLIP (ViT-G/14) on many tasks, particularly dense prediction tasks where DINOv2 shows a clear advantage. For instance, DINOv2 ViT-g achieves 86.5% top-1 linear probe accuracy on ImageNet-1k, surpassing OpenCLIP ViT-G (86.2%).
  • Robustness and Generalization: The models demonstrate strong robustness to domain shifts (ImageNet-A/R/Sketch) and generalize well qualitatively to out-of-distribution images for tasks like depth estimation and segmentation.
  • Emergent Properties: PCA analysis on patch features reveals an unsupervised ability to segment foreground objects and identify corresponding semantic parts across different instances and even categories (e.g., bird wing matching plane wing).
  • Ablation Studies: Validate the positive impact of the curated LVD-142M dataset, the specific loss components (KoLeo, iBOT MIM), the efficiency optimizations, scaling model size with data, knowledge distillation, and the high-resolution adaptation step.
  • Fairness Analysis: Evaluation on Dollar Street and Casual Conversations datasets shows improved geographical/income fairness compared to prior SSL work (SEERv2) but persistent biases towards Western/high-income groups. No major harmful label associations were found based on gender, skintone, or age.

Conclusion:

DINOv2 demonstrates that self-supervised learning, when scaled effectively with large, curated datasets and efficient training techniques, can produce general-purpose visual features that rival or surpass those learned with weak supervision (like image-text pairs). These features work well "out-of-the-box" across diverse tasks without fine-tuning, paving the way for vision foundation models. The paper provides both the models and code, emphasizing the practical applicability of the research.

Youtube Logo Streamline Icon: https://streamlinehq.com