Dual-Level RL Framework
- Dual-Level RL Framework is a hierarchical approach combining local perturbation-driven adaptation with global meta-control for specialist scheduling.
- The framework decomposes the restoration task into two interacting MDPs, enabling targeted image quality optimization and coordinated scene restoration.
- Empirical evaluations demonstrate significant IQA gains and effective restoration of adverse weather images validated on the HFLS-Weather dataset.
A dual-level reinforcement learning (RL) framework is a hierarchical architecture wherein two distinct RL processes, typically operating at different temporal, spatial, or cognitive scales, are orchestrated to solve complex decision problems. In the context of real-world adverse weather image restoration, such a framework is precisely formulated to couple low-level, weather-specific model adaptation (local MDP) with high-level, scene-adaptive meta-control (global MDP), enabling robust and continuous adaptation to nonstationary, unpaired real-world degradations (Liu et al., 7 Nov 2025).
1. Two-Level Markov Decision Process Decomposition
The dual-level RL framework formalizes the restoration problem as a pair of interacting MDPs, each with distinct state, action, transition and reward structures.
Local-Level MDP (Perturbation-Driven Image Quality Optimization):
- State: where is a degraded input image and are the current specialist-model parameters.
- Action: Small Gaussian perturbations applied to , i.e., the local policy explores parameter space stochastically.
- Transition: Deterministic update (gradient-based) and the environment supplies a new image .
- Reward: A composite no-reference image quality assessment (IQA) reward is computed for each perturbed output:
Only perturbations yielding MUSIQ increases over the baseline are retained.
Global-Level MDP (Meta-Controller for Agent Scheduling):
- State: Scene encoding and historical per-agent success rates .
- Action: Selection of a restoration specialist/model and sequence order.
- Transition: Model application mutates the image and updates and ; reward is the IQA improvement.
- Reward: Gain in PIQO-style IQA after restoration, clipped to positive/zero.
This dual-MDP approach effectively disentangles the adaptation of restoration specialists to specific, complex degradations (local) from the optimal selection and ordering of these specialists according to high-level scene descriptors (global).
2. Policy Optimization: Local and Global Interactions
Local-Level: Perturbation-Driven Reinforcement
The local-level optimization relies on a Gaussian policy over parameter perturbations. For a sampled image , perturbations are generated, applied, and the corresponding rewards computed as above. The normalized advantage for each perturbation is: The policy gradient is computed as: where indices perturbations passing the MUSIQ-score filter. To enforce trust-region constraints, the approximate parameter-space KL is
If , the gradient is downscaled: Update: .
Global-Level: Meta-Controller Policy Gradient
The global meta-controller maintains a stochastic policy . At each step, after specialist execution and observing IQA improvement , the policy is updated using the REINFORCE gradient: where is a trained baseline for variance reduction. Optionally, PPO-style clipping or further trust-region regularization can be used.
This decoupling of adaptation (exploration in parameter space) from agent scheduling (exploration of model combinations) allows for simultaneous exploitation of specialist restoration expertise and efficient global model orchestration.
3. Cold-Start Initialization with Physics-Driven Ground Truth
The effectiveness of the dual-level architecture fundamentally depends on high-quality initialization of the specialist models. The HFLS-Weather dataset, comprising one million paired clean-degraded images generated via physics- and depth-aware synthesis, serves this purpose:
- Supervised pretraining: Each specialist minimizes
( clean, degraded).
- Adversarial regularization: A small adversarial loss on encourages realism,
with total loss .
Each specialist thus receives a task-specific, data-driven cold start with strong generalization to diverse, real-world degradations.
4. Training and Inference: Local-Global RL Loop
Training Procedure
- Cold-Start: Pretrain all specialists on HFLS-Weather via .
- Initialization: Instantiate global meta-controller () and all local specialist parameters.
- Epoch Loop:
a. Sample real-world images . b. For each : i. Meta-controller proposes . ii. Apply local PIQO procedure: sample , generate outputs, compute rewards/advantages, and update using scaled gradient. iii. Observe IQA improvement; store in replay buffer. c. Update by global policy gradient with stored experience.
Inference Procedure
At test time, the inference loop is non-perturbative and strictly greedy:
- Initialize , .
- While and degradation detected: a. Compute . b. Select . c. . d. If IQA does not improve, remove from candidates; else .
- Return best encountered.
This process guarantees both modular specialist selection and sample-efficient, continual adaptation.
5. Quantitative Evaluation and Empirical Insights
Cold-Start Source and Component Ablation
- Models pretrained on HFLS-Weather yield up to Q-Align, CLIP-IQA, and MUSIQ gains over baselines trained on synthetic datasets (Table 3).
- Progressive addition of the meta-controller and PIQO consistently yields substantial CLIP-IQA and Q-Align improvements:
- For Snow: CLIP-IQA from $0.477$ (Basic) to $0.591$ (Full), Q-Align (Table 6).
State-of-the-Art Comparison
- On real-world Snow, Haze, Rain:
- DRL achieves Snow Q-Align $3.9569$ vs. next-best $3.6492$ and Rain CLIP-IQA $0.5623$ vs. next-best $0.4656$.
- Outperforms previous bests from Chen et al., WGWS, PromptIR, OneRestore, DA-CLIP, DFPIR, and JarvisIR across all metrics (Tables 4–5).
Resource Efficiency and Latency
- Inference latency is $570$ ms (multi-agent) versus $17$–$208$ ms for single-model baselines; this is several orders faster than other multi-agent systems (e.g., DA-CLIP: $6543$ ms, JarvisIR: $15250$ ms).
These results underscore that dual-level RL with cold-start and PIQO/meta-control bridges the empirical gap between synthetic training and true adverse weather generalization.
6. Significance, Generalization, and Limitations
The framework demonstrates that hierarchical RL decomposition—where parameter-space exploration (local) and agent selection order (global) are learned jointly—enables sample-efficient, label-free, and robust adaptation for challenging real-world restoration tasks. By leveraging a physics-based cold start, it avoids domain gap overfitting and supports continual learning without paired supervision or domain adaptation modules.
The approach is limited by the increased per-inference latency of multi-agent orchestration relative to single-model approaches, though this is amortized by the significant restoration quality improvements and is far lower than previous ensemble-based systems.
Potential future research directions include further reducing inference complexity via agent pruning, extending the dual-level RL paradigm to other sensor modalities, and generalizing the PIQO/meta-control scheme to non-vision tasks with analogous compositional specialist architectures.
In summary, the dual-level reinforcement learning framework achieves state-of-the-art generalization for real-world adverse weather restoration by combining high-fidelity cold-start pretraining, perturbation-driven local model adaptation, and global meta-control for agent scheduling and execution order selection, producing significant empirical gains in restoration quality and robustness over alternative single- and multi-agent baselines.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free