Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 96 tok/s
Gemini 3.0 Pro 48 tok/s Pro
Gemini 2.5 Flash 155 tok/s Pro
Kimi K2 197 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Dual-Level RL Framework

Updated 14 November 2025
  • Dual-Level RL Framework is a hierarchical approach combining local perturbation-driven adaptation with global meta-control for specialist scheduling.
  • The framework decomposes the restoration task into two interacting MDPs, enabling targeted image quality optimization and coordinated scene restoration.
  • Empirical evaluations demonstrate significant IQA gains and effective restoration of adverse weather images validated on the HFLS-Weather dataset.

A dual-level reinforcement learning (RL) framework is a hierarchical architecture wherein two distinct RL processes, typically operating at different temporal, spatial, or cognitive scales, are orchestrated to solve complex decision problems. In the context of real-world adverse weather image restoration, such a framework is precisely formulated to couple low-level, weather-specific model adaptation (local MDP) with high-level, scene-adaptive meta-control (global MDP), enabling robust and continuous adaptation to nonstationary, unpaired real-world degradations (Liu et al., 7 Nov 2025).

1. Two-Level Markov Decision Process Decomposition

The dual-level RL framework formalizes the restoration problem as a pair of interacting MDPs, each with distinct state, action, transition and reward structures.

Local-Level MDP (Perturbation-Driven Image Quality Optimization):

  • State: s=(x,θ)s = (x, \theta) where xx is a degraded input image and θRd\theta \in \mathbb{R}^d are the current specialist-model parameters.
  • Action: Small Gaussian perturbations ΔN(0,σ2I)\Delta \sim \mathcal{N}(0, \sigma^2 I) applied to θ\theta, i.e., the local policy explores parameter space stochastically.
  • Transition: Deterministic update θt+1=θt+αgt\theta_{t+1} = \theta_t + \alpha \cdot g_t (gradient-based) and the environment supplies a new image xt+1x_{t+1}.
  • Reward: A composite no-reference image quality assessment (IQA) reward is computed for each perturbed output:

ri=w1LIQE(I^i)+w2CLIP-IQA(I^i)+w3Q-Align(I^i)r_i = w_1\cdot \text{LIQE}(\hat{I}_i) + w_2\cdot \text{CLIP-IQA}(\hat{I}_i) + w_3\cdot \text{Q-Align}(\hat{I}_i)

Only perturbations yielding MUSIQ increases over the baseline are retained.

Global-Level MDP (Meta-Controller for Agent Scheduling):

  • State: Scene encoding d=CLIP_emb(x)d = \text{CLIP\_emb}(x) and historical per-agent success rates {ρj}\{\rho_j\}.
  • Action: Selection of a restoration specialist/model mM={derain, dehaze, desnow, ...}m \in M = \{\text{derain, dehaze, desnow, ...}\} and sequence order.
  • Transition: Model application mutates the image and updates dd and {ρj}\{\rho_j\}; reward is the IQA improvement.
  • Reward: Gain in PIQO-style IQA after restoration, clipped to positive/zero.

This dual-MDP approach effectively disentangles the adaptation of restoration specialists to specific, complex degradations (local) from the optimal selection and ordering of these specialists according to high-level scene descriptors (global).

2. Policy Optimization: Local and Global Interactions

Local-Level: Perturbation-Driven Reinforcement

The local-level optimization relies on a Gaussian policy over parameter perturbations. For a sampled image xx, NN perturbations {Δi}i=1N\{\Delta_i\}_{i=1}^{N} are generated, applied, and the corresponding rewards rir_i computed as above. The normalized advantage for each perturbation is: Ai=rirˉσr+ϵA_i = \frac{r_i - \bar{r}}{\sigma_r + \epsilon} The policy gradient is computed as: g=1SiSAiΔig = -\frac{1}{|S|} \sum_{i \in S} A_i \cdot \Delta_i where SS indices perturbations passing the MUSIQ-score filter. To enforce trust-region constraints, the approximate parameter-space KL is

KLapprox=1SiSΔi2d\text{KL}_\text{approx} = \frac{1}{|S|}\sum_{i \in S}\frac{\|\Delta_i\|^2}{d}

If KLapprox>τ\text{KL}_\text{approx} > \tau, the gradient is downscaled: scale={τ/KLapproxif KLapprox>τ 1otherwise\text{scale} = \begin{cases} \sqrt{\tau/\text{KL}_\text{approx}} & \text{if } \text{KL}_\text{approx} > \tau \ 1 & \text{otherwise} \end{cases} Update: θθ+ηscaleg\theta \leftarrow \theta + \eta \cdot \text{scale} \cdot g.

Global-Level: Meta-Controller Policy Gradient

The global meta-controller maintains a stochastic policy πglobal(mS;ϕ)\pi_\text{global}(m\,|\,S; \phi). At each step, after specialist execution and observing IQA improvement RtR_t, the policy is updated using the REINFORCE gradient: ϕJt(Rtb(St))ϕlogπglobal(mtSt;ϕ)\nabla_\phi J \approx \sum_t (R_t - b(S_t)) \nabla_\phi \log \pi_\text{global}(m_t | S_t; \phi) where b(St)b(S_t) is a trained baseline for variance reduction. Optionally, PPO-style clipping or further trust-region regularization can be used.

This decoupling of adaptation (exploration in parameter space) from agent scheduling (exploration of model combinations) allows for simultaneous exploitation of specialist restoration expertise and efficient global model orchestration.

3. Cold-Start Initialization with Physics-Driven Ground Truth

The effectiveness of the dual-level architecture fundamentally depends on high-quality initialization of the specialist models. The HFLS-Weather dataset, comprising one million paired clean-degraded images generated via physics- and depth-aware synthesis, serves this purpose:

  • Supervised pretraining: Each specialist fθf_\theta minimizes

Lsup(θ)=E(J,Iw)HFLSfθ(Iw)J2+λθ2L_\text{sup}(\theta) = \mathbb{E}_{(J, I_w) \sim \text{HFLS}} \|f_\theta(I_w) - J\|^2 + \lambda \|\theta\|^2

(JJ clean, IwI_w degraded).

  • Adversarial regularization: A small adversarial loss on fθ(x)f_\theta(x) encourages realism,

Ladv=Ex[logD(fθ(x))]L_\text{adv} = \mathbb{E}_x[-\log D(f_\theta(x))]

with total loss Lpre=Lsup+λadvLadvL_\text{pre} = L_\text{sup} + \lambda_\text{adv} L_\text{adv}.

Each specialist thus receives a task-specific, data-driven cold start with strong generalization to diverse, real-world degradations.

4. Training and Inference: Local-Global RL Loop

Training Procedure

  1. Cold-Start: Pretrain all fθ,kf_{\theta,k} specialists on HFLS-Weather via LpreL_\text{pre}.
  2. Initialization: Instantiate global meta-controller (ϕ\phi) and all local specialist parameters.
  3. Epoch Loop:

    a. Sample real-world images {xi}\{x_i\}. b. For each xix_i: i. Meta-controller proposes mπglobal(Si;ϕ)m \sim \pi_\text{global}(\cdot|S_i;\phi). ii. Apply local PIQO procedure: sample {Δj}\{\Delta_j\}, generate outputs, compute rewards/advantages, and update θm\theta_m using scaled gradient. iii. Observe IQA improvement; store (Si,m,R)(S_i, m, R) in replay buffer. c. Update ϕ\phi by global policy gradient with stored experience.

Inference Procedure

At test time, the inference loop is non-perturbative and strictly greedy:

  1. Initialize x0=xx^0=x, k=0k=0.
  2. While k<3k<3 and degradation detected: a. Compute S=(CLIP_emb(xk),{ρj})S = (\text{CLIP\_emb}(x^k), \{\rho_j\}). b. Select m=argmaxjπglobal(jS;ϕ)m = \arg\max_j \pi_\text{global}(j|S;\phi). c. xk+1=fθm(xk)x^{k+1} = f_{\theta_m}(x^k). d. If IQA does not improve, remove mm from candidates; else kk+1k \leftarrow k+1.
  3. Return best xkx^k encountered.

This process guarantees both modular specialist selection and sample-efficient, continual adaptation.

5. Quantitative Evaluation and Empirical Insights

Cold-Start Source and Component Ablation

  • Models pretrained on HFLS-Weather yield up to +0.5+0.5 Q-Align, +0.07+0.07 CLIP-IQA, and +6+6 MUSIQ gains over baselines trained on synthetic datasets (Table 3).
  • Progressive addition of the meta-controller and PIQO consistently yields substantial CLIP-IQA and Q-Align improvements:
    • For Snow: CLIP-IQA from $0.477$ (Basic) to $0.591$ (Full), Q-Align 3.663.953.66 \rightarrow 3.95 (Table 6).

State-of-the-Art Comparison

  • On real-world Snow, Haze, Rain:
    • DRL achieves Snow Q-Align $3.9569$ vs. next-best $3.6492$ and Rain CLIP-IQA $0.5623$ vs. next-best $0.4656$.
    • Outperforms previous bests from Chen et al., WGWS, PromptIR, OneRestore, DA-CLIP, DFPIR, and JarvisIR across all metrics (Tables 4–5).

Resource Efficiency and Latency

  • Inference latency is $570$ ms (multi-agent) versus $17$–$208$ ms for single-model baselines; this is several orders faster than other multi-agent systems (e.g., DA-CLIP: $6543$ ms, JarvisIR: $15250$ ms).

These results underscore that dual-level RL with cold-start and PIQO/meta-control bridges the empirical gap between synthetic training and true adverse weather generalization.

6. Significance, Generalization, and Limitations

The framework demonstrates that hierarchical RL decomposition—where parameter-space exploration (local) and agent selection order (global) are learned jointly—enables sample-efficient, label-free, and robust adaptation for challenging real-world restoration tasks. By leveraging a physics-based cold start, it avoids domain gap overfitting and supports continual learning without paired supervision or domain adaptation modules.

The approach is limited by the increased per-inference latency of multi-agent orchestration relative to single-model approaches, though this is amortized by the significant restoration quality improvements and is far lower than previous ensemble-based systems.

Potential future research directions include further reducing inference complexity via agent pruning, extending the dual-level RL paradigm to other sensor modalities, and generalizing the PIQO/meta-control scheme to non-vision tasks with analogous compositional specialist architectures.

In summary, the dual-level reinforcement learning framework achieves state-of-the-art generalization for real-world adverse weather restoration by combining high-fidelity cold-start pretraining, perturbation-driven local model adaptation, and global meta-control for agent scheduling and execution order selection, producing significant empirical gains in restoration quality and robustness over alternative single- and multi-agent baselines.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Dual-Level Reinforcement Learning Framework.