UIR-2.5M: Universal Image Restoration Dataset
- UIR-2.5M is a comprehensive paired image dataset consisting of approximately 2.5M low- and high-quality image pairs covering 19 degradation types with over 200 intensity levels.
- It utilizes both synthetic and 3% real-world data to support robust model training, benchmarking, and pre-training for complex image restoration tasks.
- Empirical evaluations show significant improvements in PSNR and perceptual quality, demonstrating the dataset’s effectiveness in generalizing to unseen degradation scenarios.
The UIR-2.5M dataset is a large-scale, diverse paired image dataset explicitly constructed to advance research in universal image restoration. It consists of approximately 2.5 million low-quality and high-quality image pairs, encompassing a broad spectrum of real-world and synthetic degradation types and levels. UIR-2.5M was released by the authors of "Universal Image Restoration Pre-training via Masked Degradation Classification" (Hu et al., 15 Oct 2025) as a resource for training, benchmarking, and pre-training models capable of handling complex restoration tasks within a unified framework.
1. Composition and Structure
UIR-2.5M comprises roughly 2,483,000 paired samples, where each sample consists of a low-quality (LQ), degradation-affected image, and its corresponding high-quality (HQ) reference. Its composition is structured to maximize diversity and coverage of restoration scenarios:
- Degradation Types: 19 distinct degradation categories are included, such as deraining, dehazing, denoising, deblurring, and low-light enhancement. These types span both canonical and complex, mixed degradation effects.
- Degradation Levels: Each degradation type is sampled at over 200 specific intensity levels, enabling fine-grained control and analysis of restoration performance across degradations.
- Dataset Segments:
- Single Degradation Segment: Approximately 1,774,975 image pairs, each degraded by a single degradation process.
- Mixed Degradation Segment: Approximately 708,013 image pairs, where multiple degradation types co-occur within a single image.
- Synthetic and Real-world Data: Synthetic data forms the majority, augmented with approximately 3% real-world samples to boost realism and facilitate generalization to natural imaging conditions.
A summary is shown below:
| Segment | Sample Count | Description |
|---|---|---|
| Single degradation | 1,774,975 | One degradation type per pair |
| Mixed degradation | 708,013 | Multiple degradations per pair |
| Total | ~2,483,000 | 19 types, >200 levels |
2. Generation and Curation Methodology
The curation of UIR-2.5M leverages a combination of rigorous dataset assembly and synthetic image generation:
- Base Image Selection: Images from diverse existing low-level vision datasets were selected, emphasizing wide coverage over content and image statistics.
- Synthetic Degradation: High-quality images were processed by carefully designed degradation models. These models simulate both classical effects (e.g., Gaussian noise, atmospheric distortion) and mixed real-world conditions (weather, sensor noise, motion blur). Over 200 degradation levels per type were created to ensure a fine-grained, representative sampling of possible intensity variations.
- Real-world Data Integration: Around 3% of samples originate from real-world scenarios, reflecting naturally occurring degradations. These provide benchmarks for evaluating restoration in authentic conditions and are incorporated to enhance the generalization capability of trained models.
- Alignment Strategy: Direct paired acquisition of degraded–clean real-world images is often infeasible; thus, the dataset relies primarily on synthetic pairs, with real-world data carefully filtered and validated for representational diversity.
3. Applications and Research Use Cases
UIR-2.5M is tailored for several research applications related to universal image restoration:
- Model Training: Enables supervised learning for restoration networks targeting a broad spectrum of degradations within a unified framework.
- Pre-training and Fine-tuning: Serves as an effective basis for large-scale pre-training approaches (e.g., Masked Degradation Classification Pre-training) that learn generalizable feature representations robust across degradation types and levels.
- Handling Unseen Scenarios: The diversity, both in degradation types and levels, allows models trained on UIR-2.5M to generalize effectively to previously unseen distortions and combinations, a critical aspect for robust deployment in varied real-world environments.
This diversity is particularly exploited in the MaskDCPT pre-training approach, which leverages the dataset’s comprehensive coverage for learning representations that transfer well across tasks.
4. Performance Metrics and Restoration Evaluation
Models trained and evaluated with UIR-2.5M are assessed using standard fidelity and perceptual quality metrics:
- Peak Signal-to-Noise Ratio (PSNR): Quantifies restoration fidelity with respect to the ground truth. Defined as:
where is the maximum possible pixel value and MSE is the mean squared error between the restored and ground-truth image.
- Perception-based Image Quality Evaluator (PIQE): A no-reference metric evaluating perceptual quality and distortion.
- Empirical Results: MaskDCPT pre-trained models on UIR-2.5M achieve a minimum PSNR improvement of 3.77 dB in the all-in-one restoration task, with a 34.8% reduction in PIQE compared to baselines in real-world scenarios.
These benchmarks underscore the value of large-scale, diverse training data in enhancing both objective and subjective restoration performance.
5. Generalization and Robustness to Novel Conditions
A principal advantage of UIR-2.5M lies in its facilitation of generalization:
- Unseen Degradations: Networks pre-trained (using MaskDCPT) on UIR-2.5M have demonstrated strong generalization to both “in-domain” and “out-of-domain” degradation levels, e.g., Gaussian denoising at noise levels not represented during training.
- Ablation Analysis: Models trained with UIR-2.5M outperform baseline approaches when exposed to degradation types and levels absent from the training distribution, indicating robust transfer capabilities.
- Mixed Degradation Handling: The inclusion of “mixed” segment samples allows for effective restoration in highly complex imaging scenarios, vital in real-world low-level vision tasks where degradations co-occur.
6. Accessibility and Licensing
UIR-2.5M is openly released for academic and research purposes:
- Repository: Accessible at https://github.com/MILab-PKU/MaskDCPT, which includes the dataset, associated MaskDCPT code, and models.
- Licensing and Usage: The paper does not detail licensing terms; however, such resources are typically provided under research-friendly licenses. Prospective users should consult the repository’s documentation for specifics on academic and commercial usage restrictions.
UIR-2.5M’s open distribution, comprehensive coverage, and careful design make it a central resource for universal image restoration and pre-training research, facilitating broad reproducibility, benchmarking, and methodological innovation in the field.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free