FoundIR-v2: Diffusion-Based Image Restoration

Updated 11 December 2025

FoundIR-v2 is a unified diffusion-based model that restores images by dynamically optimizing the mixture of task-specific data.
It leverages a Mixture-of-Experts scheduler with a Stable Diffusion XL backbone to adaptively generate priors for tasks like deblurring and super-resolution.
The model achieves significant performance improvements in metrics such as PSNR and SSIM across over 50 diverse image restoration sub-tasks.

FoundIR-v2 is a high-capacity diffusion-based image restoration foundation model that leverages dynamic pre-training data mixture optimization and a Mixture-of-Experts (MoE) scheduler to address over 50 sub-tasks such as deblurring, dehazing, denoising, super-resolution, deraining, desnowing, and low-light enhancement within a unified framework. It is predicated on the observation that the proportions of task-specific datasets in the pre-training mixture directly affect multi-task performance, motivating the development of a generalizable architecture that couples diffusion modeling with data equilibrium scheduling for large-scale restoration (Chen et al., 10 Dec 2025).

1. Architectural Design and Objectives

FoundIR-v2 is constructed to serve as an all-in-one restoration foundation model, supporting over 50 diverse sub-tasks. Its design jointly optimizes (i) the mixture ratios of task-specific data via "data equilibrium scheduling" (DES) to prevent task imbalance, and (ii) an MoE-driven diffusion scheduler that provides task-adaptive generative priors in latent space. The architecture integrates a Stable Diffusion XL (SDXL) backbone for latent-space denoising, a frozen VAE encoder-decoder for mapping between image and latent domains, and a learned MoE scheduler applied at each diffusion timestep.

Key components include:

VAE Encoder ( $E_{\mathrm{VAE}}$ ): Transforms low-quality input $I_{\mathrm{LQ}}$ to latent codes $f^{\mathrm{LQ}}$ , with stochastic resolution alignment for super-resolution tasks.
SDXL Denoiser ( $D_\theta$ ): Pre-trained backbone that produces clean latent codes $f^{\mathrm{HQ}}$ .
MoE Scheduler: Injected at each diffusion step, selects among $n$ expert blocks to condition the denoiser on both $f^{\mathrm{LQ}}$ and the noisy latent $x_t^{\mathrm{HQ}}$ .
Data Equilibrium Scheduler: Adjusts task sampling weights $\{\lambda_i\}$ every $T$ steps to maintain balanced learning signals across tasks.
VAE Decoder ( $D_{\mathrm{VAE}}$ ): Maps the final denoised latent code back to RGB space.

This configuration, together with multi-modal (including text) prompts, is intended to enhance the model's ability to generalize across heterogeneous and previously unseen image degradations.

2. Data Equilibrium Scheduling Paradigm

Central to FoundIR-v2 is the DES paradigm, which seeks optimal proportions in the mixture of $k$ task-specific datasets $\{\mathcal{D}_i\}_{i=1}^k$ , where the overall sampling distribution is: $P_\lambda = \sum_{i=1}^k \lambda_i\,\mathrm{Unif}(\mathcal{D}_i)\,, \quad \sum_{i=1}^k \lambda_i=1\,.$ The model $\mathcal{M}_\theta$ is trained to minimize an $\ell_1$ reconstruction loss: $\mathcal{L}_\mathrm{rec}(\theta) = \mathbb{E}_{(I_{\mathrm{LQ}},I_{\mathrm{HQ}})\sim P_\lambda} \left\|I_{\mathrm{HQ}} - \mathcal{M}_\theta(I_{\mathrm{LQ}})\right\|_1\,.$

Every $T$ training steps, the scheduler evaluates held-out task reference sets to compute score differentials $\Delta s_i^{(t)}$ and updates mixing weights via softmax re-weighting: $\lambda_i^{(t+1)} = \frac{ \lambda_i^{(t)} \exp(-\alpha\,\Delta s_i^{(t)}) } {\sum_{j=1}^k \lambda_j^{(t)} \exp(-\alpha\,\Delta s_j^{(t)}) }$ where $\alpha>0$ is a tunable coefficient.

Pseudocode for DES:

Input: initial λ⁽⁰⁾, interval T, max steps I
for t = 1 to I do
  Sample a mini-batch from each task i according to λ⁽ᵗ⁾
  Take T steps of diffusion-based training on those samples
  if t mod T == 0 then
    Evaluate s_i⁽ᵗ⁾ on D_ref for each task i
    Compute Δs_i = s_i⁽ᵗ⁾ - s_i⁽ᵗ⁻ᵀ⁾
    Update λ_i⁽ᵗ⁺¹⁾ by softmax re-weighting
  end if
end for

This iterative dynamic re-weighting enforces the "Data Mixing Law," which underlies FoundIR-v2's balanced multi-task convergence.

3. MoE-Driven Diffusion Scheduler

At each diffusion timestep $t$ and for sub-task $k$ , the MoE scheduler fuses latent codes by concatenation: $\mathbf{z}_t^{(k)} = \phi(f^{\mathrm{LQ}_k}, x_{t,k}^{\mathrm{HQ}})$ where each of $n$ expert blocks $E_i(\cdot)$ implements specialized attention (spatial, channel, sparse, etc.). The scheduler computes soft-gate weights: $w_i^{(k)} = \frac{ \exp(\mathbf{g}_i^\top\,\mathbf{z}_t^{(k)}) } { \sum_{j=1}^n \exp(\mathbf{g}_j^\top\,\mathbf{z}_t^{(k)}) }, \quad \sum_i w_i^{(k)}=1$ and forms the scheduled feature

$\mathcal{F}^{(k)}(z_t) = \sum_{i=1}^n w_i^{(k)} E_i(z_t^{(k)})$

which is passed to the SDXL noise predictor.

Diffusion training follows the denoising-score matching objective: $q(x_t \mid x_0) = \mathcal{N}(x_t; \sqrt{\bar\alpha_t}x_0, (1-\bar\alpha_t)\mathbf{I}),\ \mathcal{L}_\mathrm{diff} = \mathbb{E}_{t,\epsilon,x_0} \left\| \epsilon - \epsilon_\theta(x_t, t, \mathcal{F}^{(k)}) \right\|_2^2$ Training proceeds by first isolating MoE head pre-training (with frozen SDXL) and later joint end-to-end fine-tuning.

4. Training Protocol and Implementation

FoundIR-v2 is trained on a combination of publicly available datasets encompassing 50+ real-world sub-tasks, including but not limited to 4KRD (motion deblur), LSDIR (defocus deblur), PolyU (denoise), Dense-HAZE/NH-HAZE (dehaze), CSTNet HQ-NightRain (derain), UAV-Rain1k (raindrop removal), WeatherBench (desnow), UHD-LL (low-light), DIV2K/Flickr2K/DIV8K (super-resolution), FFHQ (faces), RealPhoto60 (real SR), and RealDeg (old-photo/face restoration). High-quality ground truth filtering is performed using deep multi-modal IQA metrics such as DA-CLIP and DepictQA.

The key hyperparameters are:

Hardware: 2× NVIDIA H20 GPUs (96 GB each)
Batch size: 32 (random $512\times512$ crops)
Optimizer: AdamW with default weight decay
Learning rate: VAE encoder $5\times10^{-6}$ , others $5\times10^{-5}$ (cosine annealing)
Total iterations: 150k; evaluation interval $T=1$ k, 10 reference samples per task
Diffusion inference: Euler sampler, 20 steps, classifier-free guidance scale = 5; AdaIN color fix for SR tasks

This protocol supports scalable, balanced exposure to the full spectrum of restoration phenomena.

5. Empirical Results, Ablations, and Analysis

FoundIR-v2 achieves leading or near-leading performance on 80% or more of evaluated tasks, using the following metrics:

Full-reference: PSNR↑, SSIM↑, LPIPS↓, MUSIQ↑, CLIPIQA+↑
No-reference: PIQE↓, MANIQA↑, PaQ-2-PiQ↑

Table: Representative performance (DES vs. static mixing)

Task	Staticmix PSNR (dB)	DES PSNR (dB)	∆ (dB)
Deblurring	18.91	20.41	+1.50
Dehazing	18.69	19.93	+1.24
Low-light	19.93	20.41	+0.48
SR	18.91	20.09	+1.18

Ablation studies indicate:

DES provides +1.2–1.5 dB gain vs. static mixing across tasks.
Soft MoE scheduling yields +0.3–0.7 dB vs. single-prior or hard MoE variants.
Removing low-quality ground truth increases PSNR by ~0.2–0.4 dB.

Qualitative results demonstrate restoration of sharp edges in motion blur, detail retention in SR, superior handling of heterogeneous murals (outperforming GPT-5 and HYPIR baselines), and effective simultaneous resolution of cascaded tasks such as deraining plus SR. FoundIR-v2 also outperforms pipeline architectures (FoundIR + SUPIR) in joint restoration settings.

Generalization extends to medical imaging domains; limited fine-tuning enables superior recovery of diagnostic structures in laparoscopy and microscopy compared to the prior FoundIR.

6. Significance, Limitations, and Open Directions

FoundIR-v2 establishes the critical importance of dynamic data mixture balancing—formalized as the "Data Mixing Law"—in achieving robust all-in-one restoration. Coupling with an MoE-driven diffusion scheduler enables task-adaptive prior generation in latent space, promoting strong generalization across a diverse sub-task landscape and favorable zero-shot transfer.

Limitations persist with respect to some extreme degradations where task-specialized models outperform the all-in-one approach. Scheduling overhead incurs modest added complexity, and scope remains for lighter-weight MoE variants.

Open directions include:

Developing adaptation signals beyond PSNR or MUSIQ for DES updates.
Scaling to temporal (video) or multi-modal (e.g., depth, semantics) restoration contexts.
Enabling continual learning for incremental addition of new tasks.
Investigating compact MoE schedulers for resource-constrained deployment.

FoundIR-v2 offers a scalable foundation for multi-task, real-world restoration, and highlights the role of adaptive data mixing for foundation models in image processing (Chen et al., 10 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

FoundIR-v2: Optimizing Pre-Training Data Mixtures for Image Restoration Foundation Model (2025)

FoundIR-v2: Diffusion-Based Image Restoration

1. Architectural Design and Objectives

2. Data Equilibrium Scheduling Paradigm

3. MoE-Driven Diffusion Scheduler

4. Training Protocol and Implementation

5. Empirical Results, Ablations, and Analysis

6. Significance, Limitations, and Open Directions

Whiteboard

Follow Topic

Continue Learning

FoundIR-v2: Diffusion-Based Image Restoration

1. Architectural Design and Objectives

2. Data Equilibrium Scheduling Paradigm

3. MoE-Driven Diffusion Scheduler

4. Training Protocol and Implementation

5. Empirical Results, Ablations, and Analysis

6. Significance, Limitations, and Open Directions

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics