FoundIR-v2: Diffusion-Based Image Restoration
- FoundIR-v2 is a unified diffusion-based model that restores images by dynamically optimizing the mixture of task-specific data.
- It leverages a Mixture-of-Experts scheduler with a Stable Diffusion XL backbone to adaptively generate priors for tasks like deblurring and super-resolution.
- The model achieves significant performance improvements in metrics such as PSNR and SSIM across over 50 diverse image restoration sub-tasks.
FoundIR-v2 is a high-capacity diffusion-based image restoration foundation model that leverages dynamic pre-training data mixture optimization and a Mixture-of-Experts (MoE) scheduler to address over 50 sub-tasks such as deblurring, dehazing, denoising, super-resolution, deraining, desnowing, and low-light enhancement within a unified framework. It is predicated on the observation that the proportions of task-specific datasets in the pre-training mixture directly affect multi-task performance, motivating the development of a generalizable architecture that couples diffusion modeling with data equilibrium scheduling for large-scale restoration (Chen et al., 10 Dec 2025).
1. Architectural Design and Objectives
FoundIR-v2 is constructed to serve as an all-in-one restoration foundation model, supporting over 50 diverse sub-tasks. Its design jointly optimizes (i) the mixture ratios of task-specific data via "data equilibrium scheduling" (DES) to prevent task imbalance, and (ii) an MoE-driven diffusion scheduler that provides task-adaptive generative priors in latent space. The architecture integrates a Stable Diffusion XL (SDXL) backbone for latent-space denoising, a frozen VAE encoder-decoder for mapping between image and latent domains, and a learned MoE scheduler applied at each diffusion timestep.
Key components include:
- VAE Encoder (): Transforms low-quality input to latent codes , with stochastic resolution alignment for super-resolution tasks.
- SDXL Denoiser (): Pre-trained backbone that produces clean latent codes .
- MoE Scheduler: Injected at each diffusion step, selects among expert blocks to condition the denoiser on both and the noisy latent .
- Data Equilibrium Scheduler: Adjusts task sampling weights every steps to maintain balanced learning signals across tasks.
- VAE Decoder (): Maps the final denoised latent code back to RGB space.
This configuration, together with multi-modal (including text) prompts, is intended to enhance the model's ability to generalize across heterogeneous and previously unseen image degradations.
2. Data Equilibrium Scheduling Paradigm
Central to FoundIR-v2 is the DES paradigm, which seeks optimal proportions in the mixture of task-specific datasets , where the overall sampling distribution is: The model is trained to minimize an reconstruction loss:
Every training steps, the scheduler evaluates held-out task reference sets to compute score differentials and updates mixing weights via softmax re-weighting: where is a tunable coefficient.
Pseudocode for DES:
1 2 3 4 5 6 7 8 9 10 |
Input: initial λ⁽⁰⁾, interval T, max steps I for t = 1 to I do Sample a mini-batch from each task i according to λ⁽ᵗ⁾ Take T steps of diffusion-based training on those samples if t mod T == 0 then Evaluate s_i⁽ᵗ⁾ on D_ref for each task i Compute Δs_i = s_i⁽ᵗ⁾ - s_i⁽ᵗ⁻ᵀ⁾ Update λ_i⁽ᵗ⁺¹⁾ by softmax re-weighting end if end for |
3. MoE-Driven Diffusion Scheduler
At each diffusion timestep and for sub-task , the MoE scheduler fuses latent codes by concatenation: where each of expert blocks implements specialized attention (spatial, channel, sparse, etc.). The scheduler computes soft-gate weights: and forms the scheduled feature
which is passed to the SDXL noise predictor.
Diffusion training follows the denoising-score matching objective: Training proceeds by first isolating MoE head pre-training (with frozen SDXL) and later joint end-to-end fine-tuning.
4. Training Protocol and Implementation
FoundIR-v2 is trained on a combination of publicly available datasets encompassing 50+ real-world sub-tasks, including but not limited to 4KRD (motion deblur), LSDIR (defocus deblur), PolyU (denoise), Dense-HAZE/NH-HAZE (dehaze), CSTNet HQ-NightRain (derain), UAV-Rain1k (raindrop removal), WeatherBench (desnow), UHD-LL (low-light), DIV2K/Flickr2K/DIV8K (super-resolution), FFHQ (faces), RealPhoto60 (real SR), and RealDeg (old-photo/face restoration). High-quality ground truth filtering is performed using deep multi-modal IQA metrics such as DA-CLIP and DepictQA.
The key hyperparameters are:
- Hardware: 2× NVIDIA H20 GPUs (96 GB each)
- Batch size: 32 (random crops)
- Optimizer: AdamW with default weight decay
- Learning rate: VAE encoder , others (cosine annealing)
- Total iterations: 150k; evaluation interval k, 10 reference samples per task
- Diffusion inference: Euler sampler, 20 steps, classifier-free guidance scale = 5; AdaIN color fix for SR tasks
This protocol supports scalable, balanced exposure to the full spectrum of restoration phenomena.
5. Empirical Results, Ablations, and Analysis
FoundIR-v2 achieves leading or near-leading performance on 80% or more of evaluated tasks, using the following metrics:
- Full-reference: PSNR↑, SSIM↑, LPIPS↓, MUSIQ↑, CLIPIQA+↑
- No-reference: PIQE↓, MANIQA↑, PaQ-2-PiQ↑
Table: Representative performance (DES vs. static mixing)
| Task | Staticmix PSNR (dB) | DES PSNR (dB) | ∆ (dB) |
|---|---|---|---|
| Deblurring | 18.91 | 20.41 | +1.50 |
| Dehazing | 18.69 | 19.93 | +1.24 |
| Low-light | 19.93 | 20.41 | +0.48 |
| SR | 18.91 | 20.09 | +1.18 |
Ablation studies indicate:
- DES provides +1.2–1.5 dB gain vs. static mixing across tasks.
- Soft MoE scheduling yields +0.3–0.7 dB vs. single-prior or hard MoE variants.
- Removing low-quality ground truth increases PSNR by ~0.2–0.4 dB.
Qualitative results demonstrate restoration of sharp edges in motion blur, detail retention in SR, superior handling of heterogeneous murals (outperforming GPT-5 and HYPIR baselines), and effective simultaneous resolution of cascaded tasks such as deraining plus SR. FoundIR-v2 also outperforms pipeline architectures (FoundIR + SUPIR) in joint restoration settings.
Generalization extends to medical imaging domains; limited fine-tuning enables superior recovery of diagnostic structures in laparoscopy and microscopy compared to the prior FoundIR.
6. Significance, Limitations, and Open Directions
FoundIR-v2 establishes the critical importance of dynamic data mixture balancing—formalized as the "Data Mixing Law"—in achieving robust all-in-one restoration. Coupling with an MoE-driven diffusion scheduler enables task-adaptive prior generation in latent space, promoting strong generalization across a diverse sub-task landscape and favorable zero-shot transfer.
Limitations persist with respect to some extreme degradations where task-specialized models outperform the all-in-one approach. Scheduling overhead incurs modest added complexity, and scope remains for lighter-weight MoE variants.
Open directions include:
- Developing adaptation signals beyond PSNR or MUSIQ for DES updates.
- Scaling to temporal (video) or multi-modal (e.g., depth, semantics) restoration contexts.
- Enabling continual learning for incremental addition of new tasks.
- Investigating compact MoE schedulers for resource-constrained deployment.
FoundIR-v2 offers a scalable foundation for multi-task, real-world restoration, and highlights the role of adaptive data mixing for foundation models in image processing (Chen et al., 10 Dec 2025).