Papers
Topics
Authors
Recent
2000 character limit reached

UIR-2.5M: Universal Image Restoration Dataset

Updated 17 October 2025
  • UIR-2.5M is a comprehensive paired image dataset consisting of approximately 2.5M low- and high-quality image pairs covering 19 degradation types with over 200 intensity levels.
  • It utilizes both synthetic and 3% real-world data to support robust model training, benchmarking, and pre-training for complex image restoration tasks.
  • Empirical evaluations show significant improvements in PSNR and perceptual quality, demonstrating the dataset’s effectiveness in generalizing to unseen degradation scenarios.

The UIR-2.5M dataset is a large-scale, diverse paired image dataset explicitly constructed to advance research in universal image restoration. It consists of approximately 2.5 million low-quality and high-quality image pairs, encompassing a broad spectrum of real-world and synthetic degradation types and levels. UIR-2.5M was released by the authors of "Universal Image Restoration Pre-training via Masked Degradation Classification" (Hu et al., 15 Oct 2025) as a resource for training, benchmarking, and pre-training models capable of handling complex restoration tasks within a unified framework.

1. Composition and Structure

UIR-2.5M comprises roughly 2,483,000 paired samples, where each sample consists of a low-quality (LQ), degradation-affected image, and its corresponding high-quality (HQ) reference. Its composition is structured to maximize diversity and coverage of restoration scenarios:

  • Degradation Types: 19 distinct degradation categories are included, such as deraining, dehazing, denoising, deblurring, and low-light enhancement. These types span both canonical and complex, mixed degradation effects.
  • Degradation Levels: Each degradation type is sampled at over 200 specific intensity levels, enabling fine-grained control and analysis of restoration performance across degradations.
  • Dataset Segments:
    • Single Degradation Segment: Approximately 1,774,975 image pairs, each degraded by a single degradation process.
    • Mixed Degradation Segment: Approximately 708,013 image pairs, where multiple degradation types co-occur within a single image.
  • Synthetic and Real-world Data: Synthetic data forms the majority, augmented with approximately 3% real-world samples to boost realism and facilitate generalization to natural imaging conditions.

A summary is shown below:

Segment Sample Count Description
Single degradation 1,774,975 One degradation type per pair
Mixed degradation 708,013 Multiple degradations per pair
Total ~2,483,000 19 types, >200 levels

2. Generation and Curation Methodology

The curation of UIR-2.5M leverages a combination of rigorous dataset assembly and synthetic image generation:

  • Base Image Selection: Images from diverse existing low-level vision datasets were selected, emphasizing wide coverage over content and image statistics.
  • Synthetic Degradation: High-quality images were processed by carefully designed degradation models. These models simulate both classical effects (e.g., Gaussian noise, atmospheric distortion) and mixed real-world conditions (weather, sensor noise, motion blur). Over 200 degradation levels per type were created to ensure a fine-grained, representative sampling of possible intensity variations.
  • Real-world Data Integration: Around 3% of samples originate from real-world scenarios, reflecting naturally occurring degradations. These provide benchmarks for evaluating restoration in authentic conditions and are incorporated to enhance the generalization capability of trained models.
  • Alignment Strategy: Direct paired acquisition of degraded–clean real-world images is often infeasible; thus, the dataset relies primarily on synthetic pairs, with real-world data carefully filtered and validated for representational diversity.

3. Applications and Research Use Cases

UIR-2.5M is tailored for several research applications related to universal image restoration:

  • Model Training: Enables supervised learning for restoration networks targeting a broad spectrum of degradations within a unified framework.
  • Pre-training and Fine-tuning: Serves as an effective basis for large-scale pre-training approaches (e.g., Masked Degradation Classification Pre-training) that learn generalizable feature representations robust across degradation types and levels.
  • Handling Unseen Scenarios: The diversity, both in degradation types and levels, allows models trained on UIR-2.5M to generalize effectively to previously unseen distortions and combinations, a critical aspect for robust deployment in varied real-world environments.

This diversity is particularly exploited in the MaskDCPT pre-training approach, which leverages the dataset’s comprehensive coverage for learning representations that transfer well across tasks.

4. Performance Metrics and Restoration Evaluation

Models trained and evaluated with UIR-2.5M are assessed using standard fidelity and perceptual quality metrics:

PSNR=10log10(MAXI2MSE)\text{PSNR} = 10 \cdot \log_{10} \left( \frac{\mathrm{MAX}_I^2}{\text{MSE}} \right)

where MAXI\mathrm{MAX}_I is the maximum possible pixel value and MSE is the mean squared error between the restored and ground-truth image.

  • Perception-based Image Quality Evaluator (PIQE): A no-reference metric evaluating perceptual quality and distortion.
  • Empirical Results: MaskDCPT pre-trained models on UIR-2.5M achieve a minimum PSNR improvement of 3.77 dB in the all-in-one restoration task, with a 34.8% reduction in PIQE compared to baselines in real-world scenarios.

These benchmarks underscore the value of large-scale, diverse training data in enhancing both objective and subjective restoration performance.

5. Generalization and Robustness to Novel Conditions

A principal advantage of UIR-2.5M lies in its facilitation of generalization:

  • Unseen Degradations: Networks pre-trained (using MaskDCPT) on UIR-2.5M have demonstrated strong generalization to both “in-domain” and “out-of-domain” degradation levels, e.g., Gaussian denoising at noise levels not represented during training.
  • Ablation Analysis: Models trained with UIR-2.5M outperform baseline approaches when exposed to degradation types and levels absent from the training distribution, indicating robust transfer capabilities.
  • Mixed Degradation Handling: The inclusion of “mixed” segment samples allows for effective restoration in highly complex imaging scenarios, vital in real-world low-level vision tasks where degradations co-occur.

6. Accessibility and Licensing

UIR-2.5M is openly released for academic and research purposes:

  • Repository: Accessible at https://github.com/MILab-PKU/MaskDCPT, which includes the dataset, associated MaskDCPT code, and models.
  • Licensing and Usage: The paper does not detail licensing terms; however, such resources are typically provided under research-friendly licenses. Prospective users should consult the repository’s documentation for specifics on academic and commercial usage restrictions.

UIR-2.5M’s open distribution, comprehensive coverage, and careful design make it a central resource for universal image restoration and pre-training research, facilitating broad reproducibility, benchmarking, and methodological innovation in the field.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to UIR-2.5M Dataset.