Papers
Topics
Authors
Recent
2000 character limit reached

FoundIR-v2: Diffusion-Based Image Restoration

Updated 11 December 2025
  • FoundIR-v2 is a unified diffusion-based model that restores images by dynamically optimizing the mixture of task-specific data.
  • It leverages a Mixture-of-Experts scheduler with a Stable Diffusion XL backbone to adaptively generate priors for tasks like deblurring and super-resolution.
  • The model achieves significant performance improvements in metrics such as PSNR and SSIM across over 50 diverse image restoration sub-tasks.

FoundIR-v2 is a high-capacity diffusion-based image restoration foundation model that leverages dynamic pre-training data mixture optimization and a Mixture-of-Experts (MoE) scheduler to address over 50 sub-tasks such as deblurring, dehazing, denoising, super-resolution, deraining, desnowing, and low-light enhancement within a unified framework. It is predicated on the observation that the proportions of task-specific datasets in the pre-training mixture directly affect multi-task performance, motivating the development of a generalizable architecture that couples diffusion modeling with data equilibrium scheduling for large-scale restoration (Chen et al., 10 Dec 2025).

1. Architectural Design and Objectives

FoundIR-v2 is constructed to serve as an all-in-one restoration foundation model, supporting over 50 diverse sub-tasks. Its design jointly optimizes (i) the mixture ratios of task-specific data via "data equilibrium scheduling" (DES) to prevent task imbalance, and (ii) an MoE-driven diffusion scheduler that provides task-adaptive generative priors in latent space. The architecture integrates a Stable Diffusion XL (SDXL) backbone for latent-space denoising, a frozen VAE encoder-decoder for mapping between image and latent domains, and a learned MoE scheduler applied at each diffusion timestep.

Key components include:

  • VAE Encoder (EVAEE_{\mathrm{VAE}}): Transforms low-quality input ILQI_{\mathrm{LQ}} to latent codes fLQf^{\mathrm{LQ}}, with stochastic resolution alignment for super-resolution tasks.
  • SDXL Denoiser (DθD_\theta): Pre-trained backbone that produces clean latent codes fHQf^{\mathrm{HQ}}.
  • MoE Scheduler: Injected at each diffusion step, selects among nn expert blocks to condition the denoiser on both fLQf^{\mathrm{LQ}} and the noisy latent xtHQx_t^{\mathrm{HQ}}.
  • Data Equilibrium Scheduler: Adjusts task sampling weights {λi}\{\lambda_i\} every TT steps to maintain balanced learning signals across tasks.
  • VAE Decoder (DVAED_{\mathrm{VAE}}): Maps the final denoised latent code back to RGB space.

This configuration, together with multi-modal (including text) prompts, is intended to enhance the model's ability to generalize across heterogeneous and previously unseen image degradations.

2. Data Equilibrium Scheduling Paradigm

Central to FoundIR-v2 is the DES paradigm, which seeks optimal proportions in the mixture of kk task-specific datasets {Di}i=1k\{\mathcal{D}_i\}_{i=1}^k, where the overall sampling distribution is: Pλ=i=1kλiUnif(Di),i=1kλi=1.P_\lambda = \sum_{i=1}^k \lambda_i\,\mathrm{Unif}(\mathcal{D}_i)\,, \quad \sum_{i=1}^k \lambda_i=1\,. The model Mθ\mathcal{M}_\theta is trained to minimize an 1\ell_1 reconstruction loss: Lrec(θ)=E(ILQ,IHQ)PλIHQMθ(ILQ)1.\mathcal{L}_\mathrm{rec}(\theta) = \mathbb{E}_{(I_{\mathrm{LQ}},I_{\mathrm{HQ}})\sim P_\lambda} \left\|I_{\mathrm{HQ}} - \mathcal{M}_\theta(I_{\mathrm{LQ}})\right\|_1\,.

Every TT training steps, the scheduler evaluates held-out task reference sets to compute score differentials Δsi(t)\Delta s_i^{(t)} and updates mixing weights via softmax re-weighting: λi(t+1)=λi(t)exp(αΔsi(t))j=1kλj(t)exp(αΔsj(t))\lambda_i^{(t+1)} = \frac{ \lambda_i^{(t)} \exp(-\alpha\,\Delta s_i^{(t)}) } {\sum_{j=1}^k \lambda_j^{(t)} \exp(-\alpha\,\Delta s_j^{(t)}) } where α>0\alpha>0 is a tunable coefficient.

Pseudocode for DES:

1
2
3
4
5
6
7
8
9
10
Input: initial λ, interval T, max steps I
for t = 1 to I do
  Sample a mini-batch from each task i according to λ
  Take T steps of diffusion-based training on those samples
  if t mod T == 0 then
    Evaluate s_i on D_ref for each task i
    Compute Δs_i = s_i - s_i
    Update λ_i¹ by softmax re-weighting
  end if
end for
This iterative dynamic re-weighting enforces the "Data Mixing Law," which underlies FoundIR-v2's balanced multi-task convergence.

3. MoE-Driven Diffusion Scheduler

At each diffusion timestep tt and for sub-task kk, the MoE scheduler fuses latent codes by concatenation: zt(k)=ϕ(fLQk,xt,kHQ)\mathbf{z}_t^{(k)} = \phi(f^{\mathrm{LQ}_k}, x_{t,k}^{\mathrm{HQ}}) where each of nn expert blocks Ei()E_i(\cdot) implements specialized attention (spatial, channel, sparse, etc.). The scheduler computes soft-gate weights: wi(k)=exp(gizt(k))j=1nexp(gjzt(k)),iwi(k)=1w_i^{(k)} = \frac{ \exp(\mathbf{g}_i^\top\,\mathbf{z}_t^{(k)}) } { \sum_{j=1}^n \exp(\mathbf{g}_j^\top\,\mathbf{z}_t^{(k)}) }, \quad \sum_i w_i^{(k)}=1 and forms the scheduled feature

F(k)(zt)=i=1nwi(k)Ei(zt(k))\mathcal{F}^{(k)}(z_t) = \sum_{i=1}^n w_i^{(k)} E_i(z_t^{(k)})

which is passed to the SDXL noise predictor.

Diffusion training follows the denoising-score matching objective: q(xtx0)=N(xt;αˉtx0,(1αˉt)I), Ldiff=Et,ϵ,x0ϵϵθ(xt,t,F(k))22q(x_t \mid x_0) = \mathcal{N}(x_t; \sqrt{\bar\alpha_t}x_0, (1-\bar\alpha_t)\mathbf{I}),\ \mathcal{L}_\mathrm{diff} = \mathbb{E}_{t,\epsilon,x_0} \left\| \epsilon - \epsilon_\theta(x_t, t, \mathcal{F}^{(k)}) \right\|_2^2 Training proceeds by first isolating MoE head pre-training (with frozen SDXL) and later joint end-to-end fine-tuning.

4. Training Protocol and Implementation

FoundIR-v2 is trained on a combination of publicly available datasets encompassing 50+ real-world sub-tasks, including but not limited to 4KRD (motion deblur), LSDIR (defocus deblur), PolyU (denoise), Dense-HAZE/NH-HAZE (dehaze), CSTNet HQ-NightRain (derain), UAV-Rain1k (raindrop removal), WeatherBench (desnow), UHD-LL (low-light), DIV2K/Flickr2K/DIV8K (super-resolution), FFHQ (faces), RealPhoto60 (real SR), and RealDeg (old-photo/face restoration). High-quality ground truth filtering is performed using deep multi-modal IQA metrics such as DA-CLIP and DepictQA.

The key hyperparameters are:

  • Hardware: 2× NVIDIA H20 GPUs (96 GB each)
  • Batch size: 32 (random 512×512512\times512 crops)
  • Optimizer: AdamW with default weight decay
  • Learning rate: VAE encoder 5×1065\times10^{-6}, others 5×1055\times10^{-5} (cosine annealing)
  • Total iterations: 150k; evaluation interval T=1T=1k, 10 reference samples per task
  • Diffusion inference: Euler sampler, 20 steps, classifier-free guidance scale = 5; AdaIN color fix for SR tasks

This protocol supports scalable, balanced exposure to the full spectrum of restoration phenomena.

5. Empirical Results, Ablations, and Analysis

FoundIR-v2 achieves leading or near-leading performance on 80% or more of evaluated tasks, using the following metrics:

  • Full-reference: PSNR↑, SSIM↑, LPIPS↓, MUSIQ↑, CLIPIQA+↑
  • No-reference: PIQE↓, MANIQA↑, PaQ-2-PiQ↑

Table: Representative performance (DES vs. static mixing)

Task Staticmix PSNR (dB) DES PSNR (dB) ∆ (dB)
Deblurring 18.91 20.41 +1.50
Dehazing 18.69 19.93 +1.24
Low-light 19.93 20.41 +0.48
SR 18.91 20.09 +1.18

Ablation studies indicate:

  • DES provides +1.2–1.5 dB gain vs. static mixing across tasks.
  • Soft MoE scheduling yields +0.3–0.7 dB vs. single-prior or hard MoE variants.
  • Removing low-quality ground truth increases PSNR by ~0.2–0.4 dB.

Qualitative results demonstrate restoration of sharp edges in motion blur, detail retention in SR, superior handling of heterogeneous murals (outperforming GPT-5 and HYPIR baselines), and effective simultaneous resolution of cascaded tasks such as deraining plus SR. FoundIR-v2 also outperforms pipeline architectures (FoundIR + SUPIR) in joint restoration settings.

Generalization extends to medical imaging domains; limited fine-tuning enables superior recovery of diagnostic structures in laparoscopy and microscopy compared to the prior FoundIR.

6. Significance, Limitations, and Open Directions

FoundIR-v2 establishes the critical importance of dynamic data mixture balancing—formalized as the "Data Mixing Law"—in achieving robust all-in-one restoration. Coupling with an MoE-driven diffusion scheduler enables task-adaptive prior generation in latent space, promoting strong generalization across a diverse sub-task landscape and favorable zero-shot transfer.

Limitations persist with respect to some extreme degradations where task-specialized models outperform the all-in-one approach. Scheduling overhead incurs modest added complexity, and scope remains for lighter-weight MoE variants.

Open directions include:

  • Developing adaptation signals beyond PSNR or MUSIQ for DES updates.
  • Scaling to temporal (video) or multi-modal (e.g., depth, semantics) restoration contexts.
  • Enabling continual learning for incremental addition of new tasks.
  • Investigating compact MoE schedulers for resource-constrained deployment.

FoundIR-v2 offers a scalable foundation for multi-task, real-world restoration, and highlights the role of adaptive data mixing for foundation models in image processing (Chen et al., 10 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to FoundIR-v2.