FastDINOv2: Frequency Based Curriculum Learning Improves Robustness and Training Speed (2507.03779v1)

Published 4 Jul 2025 in cs.CV, cs.AI, and cs.LG

Abstract: Large-scale vision foundation models such as DINOv2 boast impressive performances by leveraging massive architectures and training datasets. But numerous scenarios require practitioners to reproduce those pre-training solutions, such as on private data, new modalities, or simply for scientific questioning--which is currently extremely demanding computation-wise. We thus propose a novel pre-training strategy for DINOv2 that simultaneously accelerates convergence--and strengthens robustness to common corruptions as a by-product. Our approach involves a frequency filtering curriculum--low-frequency being seen first--and the Gaussian noise patching augmentation. Applied to a ViT-B/16 backbone trained on ImageNet-1K, while pre-training time and FLOPs are reduced by 1.6x and 2.25x, our method still achieves matching robustness in corruption benchmarks (ImageNet-C) and maintains competitive linear probing performance compared with baseline. This dual benefit of efficiency and robustness makes large-scale self-supervised foundation modeling more attainable, while opening the door to novel exploration around data curriculum and augmentation as means to improve self-supervised learning models robustness. The code is available at https://github.com/KevinZ0217/fast_dinov2

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper introduces FastDINOv2, a frequency-based curriculum that cuts pretraining time by 1.6× and FLOPs by 2.25× while maintaining competitive performance.
The method employs a two-stage training process with low-resolution pretraining followed by full-resolution refinement with Gaussian noise patching to boost robustness.
Empirical results on ImageNet and related benchmarks demonstrate that FastDINOv2 achieves efficient self-supervised learning with improved robustness and reduced computational resources.

FastDINOv2: Frequency-Based Curriculum Learning for Efficient and Robust Vision Transformer Pretraining

The paper introduces FastDINOv2, a curriculum learning framework designed to accelerate the pretraining of Vision Transformer (ViT) models in self-supervised learning (SSL) settings, while simultaneously improving robustness to common image corruptions. The approach is motivated by the computational demands of large-scale SSL models such as DINOv2, which often require extensive resources for pretraining, limiting accessibility and reproducibility. FastDINOv2 addresses these challenges by leveraging a frequency-based curriculum and targeted data augmentations.

Methodology

FastDINOv2 employs a two-stage training curriculum:

Low-Frequency Pretraining (Stage 1): For the initial 75% of training epochs, the model is trained exclusively on downsampled images, emphasizing low-frequency components. This is implemented via bicubic downsampling of image crops, reducing the input resolution (e.g., from 224×224 to 112×112 for global crops). This stage encourages the model to learn coarse, structural features efficiently, reducing both computational cost and memory requirements.
Full-Resolution with Gaussian Noise Patching (Stage 2): In the final 25% of training, the model transitions to full-resolution images. During this phase, Gaussian noise patching is applied: random image patches are replaced with Gaussian noise, introducing high-frequency perturbations. This augmentation is designed to counteract the high-frequency bias induced by the curriculum and to enhance robustness to a broader spectrum of corruptions.

A key implementation detail is the use of a constant batch size across both stages to ensure training stability. Positional embeddings are interpolated when transitioning between resolutions, with bicubic interpolation of the learned embeddings from the low-resolution stage to the full-resolution stage yielding optimal results.

Empirical Results

The proposed method is evaluated on ImageNet-100, ImageNet-1K, and their corresponding corruption benchmarks (ImageNet-100-C, ImageNet-C), using ViT-S/16 and ViT-B/16 backbones. The main findings are as follows:

Training Efficiency:

FastDINOv2 reduces pretraining time by 1.6× and FLOPs by 2.25× compared to the DINOv2 baseline, with a corresponding reduction in GPU memory requirements during the low-resolution stage (e.g., 9.47GB vs. 33.5GB for batch size 128 per GPU).

Linear Probing Performance:

The method achieves competitive linear probing accuracy on clean validation data, with only a marginal drop (e.g., 78.4% vs. 78.6% on ImageNet-1K with ViT-B/16).

Robustness to Corruptions:

On ImageNet-C, FastDINOv2 matches or exceeds the baseline in corruption robustness (e.g., 56.7% vs. 56.5% accuracy), with significant improvements for high-frequency corruptions when Gaussian noise patching is included.

Instance Recognition and Segmentation:

FastDINOv2 demonstrates improved or comparable performance on instance-level recognition (Oxford/Paris datasets) and semantic segmentation (ADE20K), indicating that the curriculum does not compromise fine-grained or pixel-level understanding.

Spectral Analysis:

Fourier heatmaps and Grad-CAM visualizations reveal that the curriculum induces a high-frequency feature bias, which is balanced by the introduction of Gaussian noise patching in the second stage. This results in a more spectrally balanced model, with improved robustness across a range of corruption types.

Numerical Highlights

Training Time (ViT-B/16, ImageNet-1K):
- DINOv2: 16.64 days, 493.76 GFLOPs
- FastDINOv2: 10.32 days, 219.92 GFLOPs
ImageNet-C Corruption Accuracy:
- DINOv2: 56.5%
- FastDINOv2: 56.7%
Clean Linear Probing Accuracy:
- DINOv2: 77.8%
- FastDINOv2: 76.2%
Instance Recognition (Oxford, Easy Split):
- DINOv2: 28.38% mAP
- FastDINOv2: 32.11% mAP
Semantic Segmentation (ADE20K, mIoU):
- DINOv2: 19.2%
- FastDINOv2: 19.16%

Implications and Discussion

The results demonstrate that frequency-based curriculum learning, when combined with targeted augmentations, can yield substantial improvements in training efficiency and robustness without requiring extreme model or dataset scale. This has several practical implications:

Resource Accessibility:

The reduction in computational and memory requirements enables broader adoption of SSL pretraining, making state-of-the-art vision models more accessible to academic and industrial practitioners with limited resources.

Robustness by Design:

The findings challenge the notion that robustness in SSL models is solely an emergent property of scale. Instead, robustness can be explicitly engineered through curriculum and augmentation strategies, providing a more principled approach to model design.

Spectral Bias Control:

The explicit manipulation of frequency content during training offers a mechanism to control the spectral bias of learned representations, which can be tuned to match the requirements of downstream tasks or deployment environments.

Generalization to Other Architectures:

While the paper focuses on DINOv2 and ViT backbones, the curriculum and augmentation principles are broadly applicable to other SSL frameworks and vision architectures.

Limitations and Future Directions

A limitation of the current approach is the use of a fixed schedule for transitioning between low- and high-resolution training. Adaptive scheduling, potentially informed by convergence metrics or validation performance, could further optimize the trade-off between efficiency and robustness. Additionally, integrating this curriculum with other robust training paradigms, such as adversarial training, may yield further gains.

Future research may explore:

Automated curriculum scheduling based on training dynamics.
Extension to multi-modal or non-visual domains.
Joint optimization of curriculum and augmentation strategies for task-specific robustness.

Conclusion

FastDINOv2 provides a practical and effective recipe for efficient and robust SSL pretraining in vision transformers. By structuring the learning process around frequency content and augmentations, it achieves strong empirical results in both efficiency and robustness, with minimal compromise in downstream performance. This work underscores the value of curriculum design and spectral analysis in the development of scalable, robust foundation models for computer vision.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (5)

GitHub

GitHub - KevinZ0217/fast_dinov2 (5 stars)

Tweets

https://twitter.com/randall_balestr/status/1942683013709496432

alphaXiv

FastDINOv2: Frequency Based Curriculum Learning Improves Robustness and Training Speed (12 likes, 0 questions)