Salience-Invariant Consistent Policy Learning
- The paper introduces SCPL, a framework that mitigates spurious correlations in visual RL by applying saliency-guided regularization to focus on task-relevant features.
- It combines value consistency, dynamics modeling, and policy consistency modules to ensure robust predictions across clean and visually perturbed observations.
- Empirical evaluations on benchmarks like DMC-GB, Robotic Manipulation, and CARLA demonstrate significant performance gains, validating its modular invariance-driven approach.
Salience-Invariant Consistent Policy Learning (SCPL) is a rigorous framework for improving zero-shot generalization in visual deep reinforcement learning (RL). SCPL addresses the core challenge that policies trained on fixed visual environments often fail to generalize under changes in appearance (e.g., background, distractors, weather) due to the agent inadvertently exploiting spurious correlations. SCPL enforces invariance to visual salience shifts and consistency between policy/value predictions on both original and perturbed observations, and does so through a modular design encompassing value regularization, dynamics modeling, and policy consistency mechanisms (Sun et al., 12 Feb 2025).
1. Problem Formulation and Core Concepts
SCPL is formulated in the context of visually-partially observable Markov decision processes, , where the agent receives image observations —pixel renderings of latent states . The principal generalization problem arises when training data and test-time perturbations (via augmentations such as random convolution or overlays) differ in task-irrelevant visual content. The key objective is to learn policies robust to such visual domain shifts, where is a convolutional encoder and are the policy parameters.
Central to SCPL is the notion of visual salience: a saliency map identifies high-task-relevance pixels by thresholding the gradients with respect to the Q-value. This supports regularization to ensure that value-predictions, and hence the policy, are driven by task-relevant image features only.
2. Algorithmic Structure and Learning Objectives
The SCPL training objective is a sum of three principal modules, each targeting a distinct failure mode in visual RL generalization:
a) Value Consistency Module
The value consistency module enforces invariance of the critic across clean and perturbed images and focuses value predictions on salient, task-relevant regions. It introduces the following components:
- Standard Bellman losses on both 0 and 1:
2
where 3.
- Saliency-guided consistency losses:
4
5
enforcing that the Q-value is unchanged when masking out non-salient pixels.
The total value loss is
6
where 7 is a hyperparameter controlling the saliency loss strength.
b) Dynamics Module
To avoid representations focusing on static or reward-irrelevant features, SCPL utilizes a dynamics module with a next-embedding predictor 8 and a reward predictor 9. The objective jointly trains on clean and perturbed data:
0
This module forces the encoder 1 to capture information pertinent to the environment’s controllable and reward-relevant structure.
c) Policy Consistency Module
A core theoretical insight is that generalization to unseen perturbations is limited by policy divergence across domains. SCPL constrains this using a KL divergence penalty:
2
which is combined with the standard maximum entropy RL loss:
3
forming the policy loss
4
with 5 tuning the consistency regularization.
d) Overall Objective
All modules are trained jointly, solving
6
with hyperparameters 7 and others per module.
3. Theoretical Properties
SCPL’s policy consistency regularizer is underpinned by a generalization bound. If 8 (the maximal per-embedding policy KL divergence) between policies on clean and perturbed observations, then the return gap is upper bounded:
9
where 0 is the maximal advantage function and 1 is the discount factor. This result motivates KL control as a direct constraint on zero-shot generalization error.
4. Training Algorithm
The practical implementation of SCPL integrates seamlessly with off-policy SAC. The main steps per iteration are:
2 This algorithm leverages off-policy optimization, saliency-guided augmentation, and joint objective minimization.
5. Empirical Evaluation
SCPL’s effectiveness is demonstrated on several challenging control benchmarks:
| Benchmark | SCPL Avg. Score | Best Baseline | Relative Gain |
|---|---|---|---|
| DMC-GB Video Hard | 853 | ~747 | 14% |
| Robotic Manipulation | 65.1 | 42.0 | 39% |
| CARLA (Test Avg.) | 352 | 208 | 69% |
SCPL is trained in a single visually-clean environment for 200K–500K steps and tested zero-shot in held-out visually shifted scenarios (e.g., novel backgrounds, weather). It consistently outperforms previous methods (SVEA, SGQN, MaDi, CNSN), validating the necessity of all three modules. Saliency-invariant value regularization plus dynamics-guided embedding and explicit policy consistency constitute an empirically robust approach to achieving high generalization in visual RL (Sun et al., 12 Feb 2025).
6. Relation to Invariant and Causal Policy Learning
The design of SCPL aligns with insights from causal-invariance literature (Saengkyongam et al., 2021). A policy is robust if it relies on d-invariant features—coordinates of the observation whose reward-relevance is stable across environments. Saliency-guided masking in SCPL empirically restricts evaluation and policy training to such invariant features by construction, sidestepping failure modes where agents overfit to task-irrelevant visual cues that shift across environments. This foundational link ensures that, under appropriate causal assumptions, SCPL-trained policies are maximally robust within the class of invariant policies.
7. Extensions and Variants
Salience-invariant and consistent policy learning has also been extended to unsupervised RL and successor representation settings, as in "Saliency-Guided Representation with Consistency Policy Learning" where an auxiliary saliency-guided dynamics head and one-step consistent policy estimation enable robust generalization in the zero-shot unsupervised regime (Sun et al., 7 Apr 2026). This variant further demonstrates the modularity and adaptability of the SCPL blueprint: saliency-guided representation learning, decoupled objectives, and policy consistency penalties together provide systematic generalization improvements across disparate control challenges.
8. Summary and Significance
SCPL advances zero-shot generalization in visual RL by regularizing value, dynamics, and policy modules to focus on task- and dynamics-relevant, saliency-invariant features. The approach leverages gradient-based saliency masking, data augmentation, and explicit policy regularization via KL divergence. The resulting framework is theoretically guaranteed—under bounded KL divergence—to limit generalization error, and experimentally establishes new state-of-the-art results in both simulated control and real-world-inspired robotics and driving benchmarks. Its principles integrate both classical invariance ideas and modern deep RL techniques, opening avenues for robust and transferable visuomotor agents (Sun et al., 12 Feb 2025).