Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Ditch the Denoiser: Emergence of Noise Robustness in Self-Supervised Learning from Data Curriculum (2505.12191v1)

Published 18 May 2025 in cs.CV, cs.AI, and cs.LG

Abstract: Self-Supervised Learning (SSL) has become a powerful solution to extract rich representations from unlabeled data. Yet, SSL research is mostly focused on clean, curated and high-quality datasets. As a result, applying SSL on noisy data remains a challenge, despite being crucial to applications such as astrophysics, medical imaging, geophysics or finance. In this work, we present a fully self-supervised framework that enables noise-robust representation learning without requiring a denoiser at inference or downstream fine-tuning. Our method first trains an SSL denoiser on noisy data, then uses it to construct a denoised-to-noisy data curriculum (i.e., training first on denoised, then noisy samples) for pretraining a SSL backbone (e.g., DINOv2), combined with a teacher-guided regularization that anchors noisy embeddings to their denoised counterparts. This process encourages the model to internalize noise robustness. Notably, the denoiser can be discarded after pretraining, simplifying deployment. On ImageNet-1k with ViT-B under extreme Gaussian noise ($\sigma=255$, SNR = 0.72 dB), our method improves linear probing accuracy by 4.8% over DINOv2, demonstrating that denoiser-free robustness can emerge from noise-aware pretraining. The code is available at https://github.com/wenquanlu/noisy_dinov2.

Summary

  • The paper presents a novel training framework that uses a two-stage curriculum—denoised pretraining followed by noisy data transition—to develop inherent noise robustness.
  • It employs an optional teacher-guided regularization to align embeddings between denoised and noisy inputs, optimizing performance across various noise levels.
  • Experimental results on synthetic noisy datasets demonstrate that the proposed method outperforms traditional denoiser pipelines in both accuracy and efficiency.

Training self-supervised learning (SSL) models like DINOv2 often relies on large, clean datasets, which are not always available in real-world applications such as medical imaging, astrophysics, or finance where data is frequently noisy. This paper, "Ditch the Denoiser: Emergence of Noise Robustness in Self-Supervised Learning from Data Curriculum" (Ditch the Denoiser: Emergence of Noise Robustness in Self-Supervised Learning from Data Curriculum, 18 May 2025), addresses the challenge of training powerful SSL models directly on noisy data without requiring an external denoiser during inference or downstream tasks. The standard approach of pre-processing noisy data with a denoiser before feeding it to the SSL model (referred to as N2N + DINOv2 baseline) adds computational overhead, latency, and complexity to the deployment pipeline. The authors propose a novel, fully self-supervised framework that encourages the SSL model to internalize noise robustness.

The core of their approach is a data curriculum training strategy combined with an optional teacher-guided regularization. The curriculum involves two main stages:

  1. Denoised Pretraining: An SSL denoiser (like Neighbor2Neighbor [neighbor2neighbor]) is first trained on the noisy dataset. This denoiser is then used to create a denoised version of the dataset. The SSL backbone (e.g., DINOv2) is initialized and trained on this denoised dataset for a certain number of epochs. This stage allows the model to learn robust features from a lower-entropy, less corrupted data distribution, providing a stable initialization.
  2. Noisy Curriculum Transition: The training is then "restarted" on the original noisy dataset, using the weights learned in the first stage as initialization. "Restart" implies resetting training dynamics like learning rates. Training continues on the noisy data, allowing the model to adapt and learn representations directly from corrupted inputs. The paper highlights that this curriculum, moving from easier (denoised) to harder (noisy) data, is crucial for developing intrinsic noise robustness. At the end of this stage, the denoiser is discarded, and the SSL model is ready for downstream tasks.

For downstream tasks like classification or instance recognition using noisy images, the learned representation gθ(z)g_\theta(z) for a noisy input zz can be directly fed into a task-specific head hθh_\theta. The entire model hθ(gθ(z))h_\theta(g_\theta(z)) can be fine-tuned end-to-end on the noisy data, and inference occurs without any intermediate denoising step:

y^=hθ(gθ(z))\hat{y} = h_\theta(g_\theta(z))

This denoiser-free pipeline significantly simplifies deployment compared to the N2N + DINOv2 baseline which requires applying the denoiser fθf_\theta before inference: y^=hθ(gθ(fθ(z)))\hat{y} = h_\theta(g_\theta(f_\theta(z))).

Under high or extreme noise levels, the authors introduce an enhanced method called DINOv2 w/ NCT (Noise Curriculum Teacher). This method adds a regularization term during the noisy training stage (Stage 2). It utilizes a frozen teacher model ($T_{\mathrm{dn}$) whose weights are fixed from the end of the denoised pretraining stage. This frozen teacher processes the denoised version of the input image (xdnx_{\mathrm{dn}}), while the trainable student (SS) processes the noisy version (xx). Crucially, identical augmentations are applied to both $x_{\mathrm{dn}$ and xx when processed by $T_{\mathrm{dn}$ and SS, respectively, to ensure alignment between their embeddings. The regularization loss encourages the student's output scores to be close to the frozen teacher's output scores on their respective (identically augmented) inputs. The total loss during noisy training becomes:

$L = L_{\mathrm{dinov2} + \lambda L_{\mathrm{dino{content}ibot}\left(T_{\mathrm{dn}\left(\tau_t\left(x_{\mathrm{dn}\right)\right), S(\tau_s(x))\right)$

where $L_{\mathrm{dinov2}$ is the standard DINOv2 loss between the trainable teacher and student on the noisy input, and the second term is the regularization with λ\lambda controlling its strength. This regularization anchors the representations learned from noisy data to the more stable representations learned from denoised data, preventing degradation under severe noise.

Implementation involves training on datasets like ImageNet-100 and ImageNet-1k corrupted with synthetic Gaussian, Shot, and Speckle noise, which mimic real-world degradations found in various applications. Noise is added to raw images before standard preprocessing steps, increasing the practical challenge. ViT-S/16 and ViT-B/16 architectures are used for the DINOv2 backbone. Training requires considerable computational resources, with experiments run on RTX 4090 or L40S GPUs, taking hours to days depending on the dataset and duration.

The experimental results demonstrate the effectiveness of the proposed methods:

  • Linear Probing: On noisy validation sets, DINOv2 w/ NC consistently and significantly outperforms training DINOv2 directly on noisy data. DINOv2 w/ NCT provides further substantial improvements under extreme noise conditions.
  • Clean Data Performance: Surprisingly, models trained with DINOv2 w/ NC or NCT often outperform the N2N + DINOv2 baseline when evaluated on clean validation sets. This indicates that the proposed methods learn more accurate and generalizable representations compared to pipelines that rely on explicit denoising during pretraining. This is a key practical implication for scenarios where noisy data is abundant for pretraining but clean data is used for fine-tuning or inference.
  • Instance Recognition: Evaluation on Oxford and Paris datasets shows similar trends, with DINOv2 w/ NC/NCT outperforming the noisy baseline and often matching or exceeding N2N + DINOv2, validating the versatility beyond classification.
  • General Applicability: The noise curriculum approach (DINOv2 w/ NC) was shown to improve performance when applied to other SSL models (SimCLR, MoCo v3, SimSiam, iBOT, DINO), suggesting its broader utility across different SSL paradigms.

Ablation studies confirm the importance of both stages of the curriculum and the specific design choices for regularization, highlighting that the denoised pretraining provides a crucial initialization and the restart allows adaptation to noise, while aligned regularization is critical under extreme noise. While scaling training duration and dataset size helps standard DINOv2 under moderate noise, the proposed methods provide a more significant boost, especially under high noise, and potentially converge faster.

Limitations of the approach include the dependency on an effective self-supervised denoiser being available for the specific noise type and the need to tune the curriculum schedule (e.g., how many epochs to train on denoised data). Future work aims to address these limitations, potentially through adaptive curriculum strategies, and to extend the methodology to other data modalities.

Github Logo Streamline Icon: https://streamlinehq.com