UIR-2.5M: Universal Image Restoration Dataset

Updated 17 October 2025

UIR-2.5M is a comprehensive paired image dataset consisting of approximately 2.5M low- and high-quality image pairs covering 19 degradation types with over 200 intensity levels.
It utilizes both synthetic and 3% real-world data to support robust model training, benchmarking, and pre-training for complex image restoration tasks.
Empirical evaluations show significant improvements in PSNR and perceptual quality, demonstrating the dataset’s effectiveness in generalizing to unseen degradation scenarios.

The UIR-2.5M dataset is a large-scale, diverse paired image dataset explicitly constructed to advance research in universal image restoration. It consists of approximately 2.5 million low-quality and high-quality image pairs, encompassing a broad spectrum of real-world and synthetic degradation types and levels. UIR-2.5M was released by the authors of "Universal Image Restoration Pre-training via Masked Degradation Classification" (Hu et al., 15 Oct 2025) as a resource for training, benchmarking, and pre-training models capable of handling complex restoration tasks within a unified framework.

1. Composition and Structure

UIR-2.5M comprises roughly 2,483,000 paired samples, where each sample consists of a low-quality (LQ), degradation-affected image, and its corresponding high-quality (HQ) reference. Its composition is structured to maximize diversity and coverage of restoration scenarios:

Degradation Types: 19 distinct degradation categories are included, such as deraining, dehazing, denoising, deblurring, and low-light enhancement. These types span both canonical and complex, mixed degradation effects.
Degradation Levels: Each degradation type is sampled at over 200 specific intensity levels, enabling fine-grained control and analysis of restoration performance across degradations.
Dataset Segments:
- Single Degradation Segment: Approximately 1,774,975 image pairs, each degraded by a single degradation process.
- Mixed Degradation Segment: Approximately 708,013 image pairs, where multiple degradation types co-occur within a single image.
Synthetic and Real-world Data: Synthetic data forms the majority, augmented with approximately 3% real-world samples to boost realism and facilitate generalization to natural imaging conditions.

A summary is shown below:

Segment	Sample Count	Description
Single degradation	1,774,975	One degradation type per pair
Mixed degradation	708,013	Multiple degradations per pair
Total	~2,483,000	19 types, >200 levels

2. Generation and Curation Methodology

The curation of UIR-2.5M leverages a combination of rigorous dataset assembly and synthetic image generation:

Base Image Selection: Images from diverse existing low-level vision datasets were selected, emphasizing wide coverage over content and image statistics.
Synthetic Degradation: High-quality images were processed by carefully designed degradation models. These models simulate both classical effects (e.g., Gaussian noise, atmospheric distortion) and mixed real-world conditions (weather, sensor noise, motion blur). Over 200 degradation levels per type were created to ensure a fine-grained, representative sampling of possible intensity variations.
Real-world Data Integration: Around 3% of samples originate from real-world scenarios, reflecting naturally occurring degradations. These provide benchmarks for evaluating restoration in authentic conditions and are incorporated to enhance the generalization capability of trained models.
Alignment Strategy: Direct paired acquisition of degraded–clean real-world images is often infeasible; thus, the dataset relies primarily on synthetic pairs, with real-world data carefully filtered and validated for representational diversity.

3. Applications and Research Use Cases

UIR-2.5M is tailored for several research applications related to universal image restoration:

Model Training: Enables supervised learning for restoration networks targeting a broad spectrum of degradations within a unified framework.
Pre-training and Fine-tuning: Serves as an effective basis for large-scale pre-training approaches (e.g., Masked Degradation Classification Pre-training) that learn generalizable feature representations robust across degradation types and levels.
Handling Unseen Scenarios: The diversity, both in degradation types and levels, allows models trained on UIR-2.5M to generalize effectively to previously unseen distortions and combinations, a critical aspect for robust deployment in varied real-world environments.

This diversity is particularly exploited in the MaskDCPT pre-training approach, which leverages the dataset’s comprehensive coverage for learning representations that transfer well across tasks.

4. Performance Metrics and Restoration Evaluation

Models trained and evaluated with UIR-2.5M are assessed using standard fidelity and perceptual quality metrics:

Peak Signal-to-Noise Ratio (PSNR): Quantifies restoration fidelity with respect to the ground truth. Defined as:

$\text{PSNR} = 10 \cdot \log_{10} \left( \frac{\mathrm{MAX}_I^2}{\text{MSE}} \right)$

where $\mathrm{MAX}_I$ is the maximum possible pixel value and MSE is the mean squared error between the restored and ground-truth image.

Perception-based Image Quality Evaluator (PIQE): A no-reference metric evaluating perceptual quality and distortion.
Empirical Results: MaskDCPT pre-trained models on UIR-2.5M achieve a minimum PSNR improvement of 3.77 dB in the all-in-one restoration task, with a 34.8% reduction in PIQE compared to baselines in real-world scenarios.

These benchmarks underscore the value of large-scale, diverse training data in enhancing both objective and subjective restoration performance.

5. Generalization and Robustness to Novel Conditions

A principal advantage of UIR-2.5M lies in its facilitation of generalization:

Unseen Degradations: Networks pre-trained (using MaskDCPT) on UIR-2.5M have demonstrated strong generalization to both “in-domain” and “out-of-domain” degradation levels, e.g., Gaussian denoising at noise levels not represented during training.
Ablation Analysis: Models trained with UIR-2.5M outperform baseline approaches when exposed to degradation types and levels absent from the training distribution, indicating robust transfer capabilities.
Mixed Degradation Handling: The inclusion of “mixed” segment samples allows for effective restoration in highly complex imaging scenarios, vital in real-world low-level vision tasks where degradations co-occur.

6. Accessibility and Licensing

UIR-2.5M is openly released for academic and research purposes:

Repository: Accessible at https://github.com/MILab-PKU/MaskDCPT, which includes the dataset, associated MaskDCPT code, and models.
Licensing and Usage: The paper does not detail licensing terms; however, such resources are typically provided under research-friendly licenses. Prospective users should consult the repository’s documentation for specifics on academic and commercial usage restrictions.

UIR-2.5M’s open distribution, comprehensive coverage, and careful design make it a central resource for universal image restoration and pre-training research, facilitating broad reproducibility, benchmarking, and methodological innovation in the field.

PDF Markdown Chat (Pro)

References (1)

Universal Image Restoration Pre-training via Masked Degradation Classification (2025)

Follow Topic

Get notified by email when new papers are published related to UIR-2.5M Dataset.