Abstract: Self-supervised learning holds the promise of eliminating the need for manual data annotation, enabling models to scale effortlessly to massive datasets and larger architectures. By not being tailored to specific tasks or domains, this training paradigm has the potential to learn visual representations from diverse sources, ranging from natural to aerial images -- using a single algorithm. This technical report introduces DINOv3, a major milestone toward realizing this vision by leveraging simple yet effective strategies. First, we leverage the benefit of scaling both dataset and model size by careful data preparation, design, and optimization. Second, we introduce a new method called Gram anchoring, which effectively addresses the known yet unsolved issue of dense feature maps degrading during long training schedules. Finally, we apply post-hoc strategies that further enhance our models' flexibility with respect to resolution, model size, and alignment with text. As a result, we present a versatile vision foundation model that outperforms the specialized state of the art across a broad range of settings, without fine-tuning. DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks, significantly surpassing previous self- and weakly-supervised foundation models. We also share the DINOv3 suite of vision models, designed to advance the state of the art on a wide spectrum of tasks and data by providing scalable solutions for diverse resource constraints and deployment scenarios.
Collections
Sign up for free to add this paper to one or more collections.
The paper introduces a scalable self-supervised learning framework using Gram anchoring to preserve dense feature consistency while creating universal vision encoders.
It integrates composite loss functions, multi-crop augmentations, and high-resolution adaptation, leveraging a massive, curated dataset for robust global and dense task performance.
The work demonstrates efficient distillation and domain generalization, enabling SOTA applications from object detection to remote sensing without task-specific fine-tuning.
DINOv3: Scalable Self-Supervised Vision Foundation Models with Gram Anchoring
Introduction and Motivation
DINOv3 advances self-supervised learning (SSL) for vision foundation models by scaling both dataset and model size, introducing novel regularization for dense features, and providing a suite of distilled models for diverse deployment scenarios. The work demonstrates that SSL, when properly scaled and regularized, can match or surpass weakly- and fully-supervised approaches on both global and dense vision tasks, without requiring fine-tuning or task-specific adaptation. DINOv3 is positioned as a universal visual encoder, capable of robust generalization across domains, including natural and aerial imagery.
Figure 1: (a) Linear probing accuracy on ImageNet1k over time for SL, WSL, and SSL methods; (b) DINOv3 dense task performance vs. WSL; (c,d) PCA maps of DINOv3 features for natural and aerial images.
Data Scaling and Curation
DINOv3 leverages a massive, curated dataset (LVD-1689M) constructed from 17B Instagram images, using hierarchical k-means clustering and retrieval-based sampling to ensure both diversity and relevance for downstream tasks. The data pipeline mixes curated, retrieval, and raw datasets, with a batch sampling strategy that includes homogeneous ImageNet1k batches for optimization. Ablation studies confirm that this hybrid curation yields superior downstream performance compared to single-method curation.
Model Architecture and Training
The main DINOv3 backbone is a custom ViT-7B (6.7B params, 40 blocks, patch size 16, axial RoPE positional embeddings with jittering), trained with a composite SSL objective: global DINO loss, local iBOT loss, and distributed Koleo regularization. Training is performed with constant hyperparameters and multi-crop augmentation, enabling indefinite scaling and stability. Register tokens are used to mitigate high-norm patch outliers, and layer normalization is applied to backbone outputs for both local and global crops, improving both kNN and dense task metrics.
Gram Anchoring: Regularization for Dense Features
Extended SSL training improves global metrics but degrades dense feature quality due to loss of patch-level consistency. DINOv3 introduces Gram anchoring, a regularization phase that aligns the Gram matrix (pairwise patch similarities) of the student to that of an early-stage teacher (Gram teacher), using the loss:
LGram=XSXS⊤−XGXG⊤F2
where XS and XG are L2-normalized patch features from student and teacher, respectively. This loss is applied post-hoc, after 1M iterations, and the Gram teacher is periodically updated. High-resolution Gram anchoring further improves dense feature quality by using teacher features from upsampled images, then downsampling to match student output.
Figure 2: Evolution of cosine similarities and task accuracy for ViT-g and ViT-7B; segmentation peaks when patch-class similarities are low, then degrades as similarities increase.
Figure 3: Gram matrices at different input resolutions; downsampling high-res features preserves patch-level consistency.
Figure 4: Cosine similarity maps before and after Gram anchoring; refinement objective LHRef yields cleaner, more localized features.
Post-Training: Resolution Adaptation and Distillation
A high-resolution adaptation phase enables DINOv3 to generalize across input sizes, using mixed-resolution crops and Gram anchoring. Empirically, this step is essential for maintaining dense feature quality at high resolutions, with models supporting inference up to 4096×4096 pixels.
Distillation transfers knowledge from the 7B teacher to smaller ViT and ConvNeXt variants, using a multi-student pipeline that shares teacher inference across GPUs for efficiency. Distilled models (ViT-S, B, L, H+, CNX-T/B/L) achieve performance close to the teacher, with ViT-H+ nearly matching ViT-7B despite 10x fewer parameters.
Figure 5: Multi-student distillation: teacher inference shared across all nodes, students trained in parallel with synchronized groups.
Figure 6: DINOv3 family of models: parameter counts and FLOPs for ViT and ConvNeXt variants.
Dense Feature Quality and Stability
DINOv3 produces high-quality, stable dense features across resolutions, outperforming both self- and weakly-supervised baselines (DINOv2, SigLIP2, PEspatial, AM-RADIO) on segmentation, depth estimation, 3D correspondence, object discovery, and video tracking. Dense features are visualized via PCA, showing sharp, semantically coherent maps with minimal noise.
Figure 7: Cosine similarity maps for 4096×4096 input; DINOv3 features are highly localized and consistent.
Figure 8: PCA visualization of dense features at increasing resolutions; DINOv3 maintains semantic structure and crispness.
Figure 9: Feature stability across resolutions for ViT-S, S+, B, L, H+; features remain consistent before drifting at extreme sizes.
System-Level Applications
DINOv3 serves as a frozen backbone for state-of-the-art systems in object detection (Plain-DETR), semantic segmentation (Mask2Former + ViT-Adapter), monocular depth estimation (Depth Anything v2), and 3D scene understanding (VGGT). In all cases, DINOv3-based systems match or exceed prior SOTA, often with fewer trainable parameters and no backbone fine-tuning.
Domain Generalization: Geospatial and Remote Sensing
DINOv3 is applied to satellite imagery (SAT-493M, Open-Canopy), achieving SOTA in canopy height estimation, semantic segmentation, and object detection, outperforming domain-specific models (Prithvi-v2, DOFA) even with RGB-only input. Both web- and satellite-pretrained DINOv3 models generalize well, with domain-specific pretraining yielding best results for metric tasks.
Figure 10: DINOv3 features and segmentation for remote sensing; PCA maps show finer details than DINOv2, segmentation and canopy height prediction performed with frozen backbone.
Figure 11: Qualitative comparison of DINOv3 7B satellite model to prior work on Open-Canopy; DINOv3 yields more accurate height maps.
Zero-Shot and Multimodal Alignment
A text encoder is trained via LiT-style contrastive alignment to DINOv3 features, enabling zero-shot classification and open-vocabulary segmentation. DINOv3-based dino.txt achieves competitive global alignment and SOTA dense alignment, outperforming CLIP and EVA-02-CLIP on segmentation tasks.
Implementation Considerations
Compute: Training ViT-7B requires 61,440 GPU hours (H100), with a carbon footprint of ~18 tCO2eq per model.
Scaling: Gram anchoring and register tokens are essential for stability and dense feature quality at scale.
Distillation: Multi-student distillation is efficient and enables deployment across resource budgets.
Resolution: High-res adaptation and RoPE positional embeddings allow inference at arbitrary resolutions.
Domain Transfer: SSL recipe is generic; domain-specific pretraining improves metric tasks, but web-pretrained models generalize well for semantic tasks.
Implications and Future Directions
DINOv3 demonstrates that SSL, when scaled and regularized, can produce universal vision encoders with robust, high-quality dense and global features. The Gram anchoring method resolves a key limitation of prior SSL scaling, enabling indefinite training without dense feature collapse. The model family supports deployment from edge devices to large-scale servers, and the approach generalizes to specialized domains such as remote sensing.
Future work may explore:
Further scaling of model and data size, leveraging unlabeled data from diverse domains.
Integration of multimodal alignment during pretraining, rather than post-hoc.
Efficient quantization and deployment strategies for transformer-based vision models.
Application to lifelong learning and continual adaptation scenarios.
Extension to video and 3D modalities, leveraging DINOv3's strong temporal and geometric consistency.
Conclusion
DINOv3 sets a new standard for self-supervised vision foundation models, achieving SOTA on dense and global tasks with a frozen backbone, scalable architecture, and robust regularization. The Gram anchoring technique is critical for maintaining dense feature quality at scale, and the distilled model family enables practical deployment. The approach generalizes across domains and tasks, supporting both universal and specialized applications in computer vision.