- The paper presents a novel training framework that uses a two-stage curriculum—denoised pretraining followed by noisy data transition—to develop inherent noise robustness.
- It employs an optional teacher-guided regularization to align embeddings between denoised and noisy inputs, optimizing performance across various noise levels.
- Experimental results on synthetic noisy datasets demonstrate that the proposed method outperforms traditional denoiser pipelines in both accuracy and efficiency.
Training self-supervised learning (SSL) models like DINOv2 often relies on large, clean datasets, which are not always available in real-world applications such as medical imaging, astrophysics, or finance where data is frequently noisy. This paper, "Ditch the Denoiser: Emergence of Noise Robustness in Self-Supervised Learning from Data Curriculum" (Ditch the Denoiser: Emergence of Noise Robustness in Self-Supervised Learning from Data Curriculum, 18 May 2025), addresses the challenge of training powerful SSL models directly on noisy data without requiring an external denoiser during inference or downstream tasks. The standard approach of pre-processing noisy data with a denoiser before feeding it to the SSL model (referred to as N2N + DINOv2 baseline) adds computational overhead, latency, and complexity to the deployment pipeline. The authors propose a novel, fully self-supervised framework that encourages the SSL model to internalize noise robustness.
The core of their approach is a data curriculum training strategy combined with an optional teacher-guided regularization. The curriculum involves two main stages:
- Denoised Pretraining: An SSL denoiser (like Neighbor2Neighbor [neighbor2neighbor]) is first trained on the noisy dataset. This denoiser is then used to create a denoised version of the dataset. The SSL backbone (e.g., DINOv2) is initialized and trained on this denoised dataset for a certain number of epochs. This stage allows the model to learn robust features from a lower-entropy, less corrupted data distribution, providing a stable initialization.
- Noisy Curriculum Transition: The training is then "restarted" on the original noisy dataset, using the weights learned in the first stage as initialization. "Restart" implies resetting training dynamics like learning rates. Training continues on the noisy data, allowing the model to adapt and learn representations directly from corrupted inputs. The paper highlights that this curriculum, moving from easier (denoised) to harder (noisy) data, is crucial for developing intrinsic noise robustness. At the end of this stage, the denoiser is discarded, and the SSL model is ready for downstream tasks.
For downstream tasks like classification or instance recognition using noisy images, the learned representation gθ(z) for a noisy input z can be directly fed into a task-specific head hθ. The entire model hθ(gθ(z)) can be fine-tuned end-to-end on the noisy data, and inference occurs without any intermediate denoising step:
y^=hθ(gθ(z))
This denoiser-free pipeline significantly simplifies deployment compared to the N2N + DINOv2 baseline which requires applying the denoiser fθ before inference: y^=hθ(gθ(fθ(z))).
Under high or extreme noise levels, the authors introduce an enhanced method called DINOv2 w/ NCT (Noise Curriculum Teacher). This method adds a regularization term during the noisy training stage (Stage 2). It utilizes a frozen teacher model ($T_{\mathrm{dn}$) whose weights are fixed from the end of the denoised pretraining stage. This frozen teacher processes the denoised version of the input image (xdn), while the trainable student (S) processes the noisy version (x). Crucially, identical augmentations are applied to both $x_{\mathrm{dn}$ and x when processed by $T_{\mathrm{dn}$ and S, respectively, to ensure alignment between their embeddings. The regularization loss encourages the student's output scores to be close to the frozen teacher's output scores on their respective (identically augmented) inputs. The total loss during noisy training becomes:
$L = L_{\mathrm{dinov2} + \lambda L_{\mathrm{dino{content}ibot}\left(T_{\mathrm{dn}\left(\tau_t\left(x_{\mathrm{dn}\right)\right), S(\tau_s(x))\right)$
where $L_{\mathrm{dinov2}$ is the standard DINOv2 loss between the trainable teacher and student on the noisy input, and the second term is the regularization with λ controlling its strength. This regularization anchors the representations learned from noisy data to the more stable representations learned from denoised data, preventing degradation under severe noise.
Implementation involves training on datasets like ImageNet-100 and ImageNet-1k corrupted with synthetic Gaussian, Shot, and Speckle noise, which mimic real-world degradations found in various applications. Noise is added to raw images before standard preprocessing steps, increasing the practical challenge. ViT-S/16 and ViT-B/16 architectures are used for the DINOv2 backbone. Training requires considerable computational resources, with experiments run on RTX 4090 or L40S GPUs, taking hours to days depending on the dataset and duration.
The experimental results demonstrate the effectiveness of the proposed methods:
- Linear Probing: On noisy validation sets, DINOv2 w/ NC consistently and significantly outperforms training DINOv2 directly on noisy data. DINOv2 w/ NCT provides further substantial improvements under extreme noise conditions.
- Clean Data Performance: Surprisingly, models trained with DINOv2 w/ NC or NCT often outperform the N2N + DINOv2 baseline when evaluated on clean validation sets. This indicates that the proposed methods learn more accurate and generalizable representations compared to pipelines that rely on explicit denoising during pretraining. This is a key practical implication for scenarios where noisy data is abundant for pretraining but clean data is used for fine-tuning or inference.
- Instance Recognition: Evaluation on Oxford and Paris datasets shows similar trends, with DINOv2 w/ NC/NCT outperforming the noisy baseline and often matching or exceeding N2N + DINOv2, validating the versatility beyond classification.
- General Applicability: The noise curriculum approach (DINOv2 w/ NC) was shown to improve performance when applied to other SSL models (SimCLR, MoCo v3, SimSiam, iBOT, DINO), suggesting its broader utility across different SSL paradigms.
Ablation studies confirm the importance of both stages of the curriculum and the specific design choices for regularization, highlighting that the denoised pretraining provides a crucial initialization and the restart allows adaptation to noise, while aligned regularization is critical under extreme noise. While scaling training duration and dataset size helps standard DINOv2 under moderate noise, the proposed methods provide a more significant boost, especially under high noise, and potentially converge faster.
Limitations of the approach include the dependency on an effective self-supervised denoiser being available for the specific noise type and the need to tune the curriculum schedule (e.g., how many epochs to train on denoised data). Future work aims to address these limitations, potentially through adaptive curriculum strategies, and to extend the methodology to other data modalities.