2000 character limit reached

Dual-Level RL Framework

Updated 14 November 2025

Dual-Level RL Framework is a hierarchical approach combining local perturbation-driven adaptation with global meta-control for specialist scheduling.
The framework decomposes the restoration task into two interacting MDPs, enabling targeted image quality optimization and coordinated scene restoration.
Empirical evaluations demonstrate significant IQA gains and effective restoration of adverse weather images validated on the HFLS-Weather dataset.

A dual-level reinforcement learning (RL) framework is a hierarchical architecture wherein two distinct RL processes, typically operating at different temporal, spatial, or cognitive scales, are orchestrated to solve complex decision problems. In the context of real-world adverse weather image restoration, such a framework is precisely formulated to couple low-level, weather-specific model adaptation (local MDP) with high-level, scene-adaptive meta-control (global MDP), enabling robust and continuous adaptation to nonstationary, unpaired real-world degradations (Liu et al., 7 Nov 2025).

1. Two-Level Markov Decision Process Decomposition

The dual-level RL framework formalizes the restoration problem as a pair of interacting MDPs, each with distinct state, action, transition and reward structures.

Local-Level MDP (Perturbation-Driven Image Quality Optimization):

State: $s = (x, \theta)$ where $x$ is a degraded input image and $\theta \in \mathbb{R}^d$ are the current specialist-model parameters.
Action: Small Gaussian perturbations $\Delta \sim \mathcal{N}(0, \sigma^2 I)$ applied to $\theta$ , i.e., the local policy explores parameter space stochastically.
Transition: Deterministic update $\theta_{t+1} = \theta_t + \alpha \cdot g_t$ (gradient-based) and the environment supplies a new image $x_{t+1}$ .
Reward: A composite no-reference image quality assessment (IQA) reward is computed for each perturbed output:

$r_i = w_1\cdot \text{LIQE}(\hat{I}_i) + w_2\cdot \text{CLIP-IQA}(\hat{I}_i) + w_3\cdot \text{Q-Align}(\hat{I}_i)$

Only perturbations yielding MUSIQ increases over the baseline are retained.

Global-Level MDP (Meta-Controller for Agent Scheduling):

State: Scene encoding $d = \text{CLIP\_emb}(x)$ and historical per-agent success rates $\{\rho_j\}$ .
Action: Selection of a restoration specialist/model $m \in M = \{\text{derain, dehaze, desnow, ...}\}$ and sequence order.
Transition: Model application mutates the image and updates $d$ and $\{\rho_j\}$ ; reward is the IQA improvement.
Reward: Gain in PIQO-style IQA after restoration, clipped to positive/zero.

This dual-MDP approach effectively disentangles the adaptation of restoration specialists to specific, complex degradations (local) from the optimal selection and ordering of these specialists according to high-level scene descriptors (global).

2. Policy Optimization: Local and Global Interactions

Local-Level: Perturbation-Driven Reinforcement

The local-level optimization relies on a Gaussian policy over parameter perturbations. For a sampled image $x$ , $N$ perturbations $\{\Delta_i\}_{i=1}^{N}$ are generated, applied, and the corresponding rewards $r_i$ computed as above. The normalized advantage for each perturbation is: $A_i = \frac{r_i - \bar{r}}{\sigma_r + \epsilon}$ The policy gradient is computed as: $g = -\frac{1}{|S|} \sum_{i \in S} A_i \cdot \Delta_i$ where $S$ indices perturbations passing the MUSIQ-score filter. To enforce trust-region constraints, the approximate parameter-space KL is

$\text{KL}_\text{approx} = \frac{1}{|S|}\sum_{i \in S}\frac{\|\Delta_i\|^2}{d}$

If $\text{KL}_\text{approx} > \tau$ , the gradient is downscaled: $\text{scale} = \begin{cases} \sqrt{\tau/\text{KL}_\text{approx}} & \text{if } \text{KL}_\text{approx} > \tau \ 1 & \text{otherwise} \end{cases}$ Update: $\theta \leftarrow \theta + \eta \cdot \text{scale} \cdot g$ .

Global-Level: Meta-Controller Policy Gradient

The global meta-controller maintains a stochastic policy $\pi_\text{global}(m\,|\,S; \phi)$ . At each step, after specialist execution and observing IQA improvement $R_t$ , the policy is updated using the REINFORCE gradient: $\nabla_\phi J \approx \sum_t (R_t - b(S_t)) \nabla_\phi \log \pi_\text{global}(m_t | S_t; \phi)$ where $b(S_t)$ is a trained baseline for variance reduction. Optionally, PPO-style clipping or further trust-region regularization can be used.

This decoupling of adaptation (exploration in parameter space) from agent scheduling (exploration of model combinations) allows for simultaneous exploitation of specialist restoration expertise and efficient global model orchestration.

3. Cold-Start Initialization with Physics-Driven Ground Truth

The effectiveness of the dual-level architecture fundamentally depends on high-quality initialization of the specialist models. The HFLS-Weather dataset, comprising one million paired clean-degraded images generated via physics- and depth-aware synthesis, serves this purpose:

Supervised pretraining: Each specialist $f_\theta$ minimizes

$L_\text{sup}(\theta) = \mathbb{E}_{(J, I_w) \sim \text{HFLS}} \|f_\theta(I_w) - J\|^2 + \lambda \|\theta\|^2$

( $J$ clean, $I_w$ degraded).

Adversarial regularization: A small adversarial loss on $f_\theta(x)$ encourages realism,

$L_\text{adv} = \mathbb{E}_x[-\log D(f_\theta(x))]$

with total loss $L_\text{pre} = L_\text{sup} + \lambda_\text{adv} L_\text{adv}$ .

Each specialist thus receives a task-specific, data-driven cold start with strong generalization to diverse, real-world degradations.

4. Training and Inference: Local-Global RL Loop

Training Procedure

Cold-Start: Pretrain all $f_{\theta,k}$ specialists on HFLS-Weather via $L_\text{pre}$ .
Initialization: Instantiate global meta-controller ( $\phi$ ) and all local specialist parameters.
Epoch Loop:

a. Sample real-world images $\{x_i\}$ . b. For each $x_i$ : i. Meta-controller proposes $m \sim \pi_\text{global}(\cdot|S_i;\phi)$ . ii. Apply local PIQO procedure: sample $\{\Delta_j\}$ , generate outputs, compute rewards/advantages, and update $\theta_m$ using scaled gradient. iii. Observe IQA improvement; store $(S_i, m, R)$ in replay buffer. c. Update $\phi$ by global policy gradient with stored experience.

Inference Procedure

At test time, the inference loop is non-perturbative and strictly greedy:

Initialize $x^0=x$ , $k=0$ .
While $k<3$ and degradation detected: a. Compute $S = (\text{CLIP\_emb}(x^k), \{\rho_j\})$ . b. Select $m = \arg\max_j \pi_\text{global}(j|S;\phi)$ . c. $x^{k+1} = f_{\theta_m}(x^k)$ . d. If IQA does not improve, remove $m$ from candidates; else $k \leftarrow k+1$ .
Return best $x^k$ encountered.

This process guarantees both modular specialist selection and sample-efficient, continual adaptation.

5. Quantitative Evaluation and Empirical Insights

Cold-Start Source and Component Ablation

Models pretrained on HFLS-Weather yield up to $+0.5$ Q-Align, $+0.07$ CLIP-IQA, and $+6$ MUSIQ gains over baselines trained on synthetic datasets (Table 3).
Progressive addition of the meta-controller and PIQO consistently yields substantial CLIP-IQA and Q-Align improvements:
- For Snow: CLIP-IQA from $0.477$ (Basic) to $0.591$ (Full), Q-Align $3.66 \rightarrow 3.95$ (Table 6).

State-of-the-Art Comparison

On real-world Snow, Haze, Rain:
- DRL achieves Snow Q-Align $3.9569$ vs. next-best $3.6492$ and Rain CLIP-IQA $0.5623$ vs. next-best $0.4656$.
- Outperforms previous bests from Chen et al., WGWS, PromptIR, OneRestore, DA-CLIP, DFPIR, and JarvisIR across all metrics (Tables 4–5).

Resource Efficiency and Latency

Inference latency is $570$ ms (multi-agent) versus $17$–$208$ ms for single-model baselines; this is several orders faster than other multi-agent systems (e.g., DA-CLIP: $6543$ ms, JarvisIR: $15250$ ms).

These results underscore that dual-level RL with cold-start and PIQO/meta-control bridges the empirical gap between synthetic training and true adverse weather generalization.

6. Significance, Generalization, and Limitations

The framework demonstrates that hierarchical RL decomposition—where parameter-space exploration (local) and agent selection order (global) are learned jointly—enables sample-efficient, label-free, and robust adaptation for challenging real-world restoration tasks. By leveraging a physics-based cold start, it avoids domain gap overfitting and supports continual learning without paired supervision or domain adaptation modules.

The approach is limited by the increased per-inference latency of multi-agent orchestration relative to single-model approaches, though this is amortized by the significant restoration quality improvements and is far lower than previous ensemble-based systems.

Potential future research directions include further reducing inference complexity via agent pruning, extending the dual-level RL paradigm to other sensor modalities, and generalizing the PIQO/meta-control scheme to non-vision tasks with analogous compositional specialist architectures.

In summary, the dual-level reinforcement learning framework achieves state-of-the-art generalization for real-world adverse weather restoration by combining high-fidelity cold-start pretraining, perturbation-driven local model adaptation, and global meta-control for agent scheduling and execution order selection, producing significant empirical gains in restoration quality and robustness over alternative single- and multi-agent baselines.

PDF Markdown Chat (Pro)

References (1)

Real-World Adverse Weather Image Restoration via Dual-Level Reinforcement Learning with High-Quality Cold Start (2025)

Follow Topic

Get notified by email when new papers are published related to Dual-Level Reinforcement Learning Framework.