RAM++: Robust Representation Learning via Adaptive Mask for All-in-One Image Restoration (2509.12039v1)

Published 15 Sep 2025 in cs.CV

Abstract: This work presents Robust Representation Learning via Adaptive Mask (RAM++), a two-stage framework for all-in-one image restoration. RAM++ integrates high-level semantic understanding with low-level texture generation to achieve content-oriented robust restoration. It addresses the limitations of existing degradation-oriented methods in extreme scenarios (e.g., degradations strongly coupled with image structures). RAM++ also mitigates common challenges such as unbalanced performance across tasks, overfitting to seen degradations, and weak generalization to unseen ones through three key designs: 1) Adaptive Semantic-Aware Mask (AdaSAM): a pretraining strategy that applies pixel-level masks to semantically rich and textured regions. This design enables the network to learn both generative priors and image content priors from various degradations. 2) Mask Attribute Conductance (MAC): a selective fine-tuning strategy that adjusts the layers with higher contributions to bridge the integrity gap between masked pretraining and full-image fine-tuning while retaining learned priors. 3) Robust Feature Regularization (RFR): a strategy that leverages DINOv2's semantically consistent and degradation-invariant representations, together with efficient feature fusion, to achieve faithful and semantically coherent restoration. With these designs, RAM++ achieves robust, well-balanced, and state-of-the-art performance across seen, unseen, extreme, and mixed degradations. Our code and model will be released at https://github.com/DragonisCV/RAM

Summary

The paper presents an innovative content-oriented framework, RAM++, that leverages adaptive masking and foundation model regularization to robustly restore images across diverse degradations.
The methodology integrates Adaptive Semantic-Aware Mask pre-training, Mask Attribute Conductance fine-tuning, and DINOv2-based Robust Feature Regularization to achieve balanced performance and state-of-the-art results.
Experimental results demonstrate improved PSNR/SSIM metrics and strong generalization on both seen and out-of-distribution degradations, validating its efficacy in unified image restoration.

RAM++: Robust Representation Learning via Adaptive Mask for All-in-One Image Restoration

Introduction and Motivation

The RAM++ framework addresses the persistent challenges in all-in-one image restoration, where a single model is required to handle diverse and often strongly coupled degradations (e.g., rain, haze, noise, blur, compression artifacts). Prior approaches predominantly focus on degradation-oriented modeling, which leads to overfitting, unbalanced task performance, and poor generalization to unseen or mixed degradations. RAM++ proposes a content-oriented paradigm, emphasizing the extraction of robust, intrinsic image representations that are invariant to degradation types. The framework is built upon three core innovations: Adaptive Semantic-Aware Mask (AdaSAM) pre-training, Mask Attribute Conductance (MAC) guided fine-tuning, and Robust Feature Regularization (RFR) leveraging DINOv2 features.

Methodology

Adaptive Semantic-Aware Mask (AdaSAM) Pre-training

RAM++ introduces a two-stage training pipeline. In the pre-training stage, AdaSAM generates pixel-level masks that target semantically and texturally rich regions in degraded images. Unlike random masking or coarse patch-based masking, AdaSAM computes region importance via attention mechanisms and propagates these scores to the pixel level, followed by multinomial sampling to select high-information regions for masking. The restoration network is then trained to reconstruct the clean content at these masked locations, using paired degraded-clean data and an L1 loss restricted to masked regions. This adversarial-like setup between the masking and restoration networks compels the model to learn both generative and content priors, aggregating diverse degradations into a unified, semantically consistent latent space.

Mask Attribute Conductance (MAC) Fine-tuning

A key challenge in MIM-based pre-training for low-level vision is the input integrity gap: the model is pre-trained on masked images but must operate on full images at inference. To bridge this, RAM++ introduces MAC, a gradient-based attribution method that quantifies the contribution of each network layer to resolving the input integrity gap. MAC extends integrated gradients and neuron conductance to a mask attribute path, enabling the identification of the most critical layers for fine-tuning. By updating only the top $k\%$ of layers (empirically as low as 30%), RAM++ preserves the pre-trained priors while adapting to the full-image input, achieving strong performance with minimal overfitting.

Robust Feature Regularization (RFR) with DINOv2

RAM++ further incorporates DINOv2, a vision foundation model known for its semantic consistency and degradation-invariant representations. During fine-tuning, multi-level DINOv2 features are dynamically fused with the restoration network's features via a lightweight gating and projection mechanism. This fusion enhances the model's ability to extract essential information, stabilizes feature representations, and improves generalization across both seen and unseen degradations. The fusion is performed at multiple hierarchical levels, and a simple $1 \times 1$ convolution is used to align and integrate the features.

Experimental Results

Benchmark Performance

RAM++ is evaluated on both 3-task and 7-task all-in-one restoration settings, covering dehazing, deraining, denoising, motion deblurring, low-light enhancement, kernel deblurring, and JPEG artifact removal. Across all benchmarks, RAM++ achieves state-of-the-art or highly competitive results, with notable improvements in PSNR and SSIM over prior methods. For example, in the 7-task setting, RAM++ outperforms the next-best method by up to 0.70 dB in average PSNR. The model maintains balanced performance across tasks, exhibiting the lowest variance in PSNR/SSIM as the number of tasks increases.

Generalization and Robustness

RAM++ demonstrates strong generalization to out-of-distribution (OOD) degradations, including unseen noise types and underwater image enhancement. On the Urban100 dataset with OOD noise, RAM++ achieves an average PSNR gain of 1.84 dB over the Restormer baseline. On the UIEB underwater dataset, RAM++ outperforms all prior all-in-one and pre-training-based methods, indicating effective transfer of learned priors to novel domains.

Ablation and Analysis

Ablation studies confirm the necessity of each component:

AdaSAM: Outperforms random masking strategies, with pixel-level adaptive masking yielding the best results.
MAC: Selective fine-tuning based on MAC outperforms both random and integrated gradient-based selection.
RFR: Multi-level DINOv2 feature fusion provides significant gains over single-level or alternative fusion strategies.
Fine-tuning Ratio: Updating a small fraction of layers (10–30%) achieves near-optimal performance and superior generalization (lower SRGA scores), while full fine-tuning can lead to overfitting.

Interpretability analyses using Causal Effect Maps (CEM) show that RAM++ maintains a broad and effective receptive field, accurately discriminates between positive and negative information, and prioritizes background structure reconstruction over degradation removal. This content-oriented learning is critical for robust generalization.

Implications and Future Directions

RAM++ establishes a new paradigm for all-in-one image restoration by shifting the focus from degradation classification to robust content representation. The integration of adaptive masking, selective fine-tuning, and foundation model regularization enables balanced, scalable, and generalizable restoration across a wide range of degradations. The framework's modularity allows for straightforward extension to new tasks and domains.

Potential future directions include:

Multi-task learning and data mixing: Addressing inherent conflicts between diverse degradations in mixed datasets.
Extension to video restoration: Incorporating temporal consistency for sequential data.
Further leveraging foundation models: Exploring larger or multimodal vision-LLMs for richer priors.

Conclusion

RAM++ demonstrates that robust, content-oriented representation learning—enabled by adaptive masking, selective fine-tuning, and foundation model regularization—can overcome the limitations of degradation-oriented approaches in all-in-one image restoration. The framework achieves state-of-the-art performance, strong generalization, and balanced results across diverse and challenging scenarios, providing a solid foundation for future research in unified image restoration systems.