Papers
Topics
Authors
Recent
Search
2000 character limit reached

Salience-Invariant Consistent Policy Learning

Updated 22 April 2026
  • The paper introduces SCPL, a framework that mitigates spurious correlations in visual RL by applying saliency-guided regularization to focus on task-relevant features.
  • It combines value consistency, dynamics modeling, and policy consistency modules to ensure robust predictions across clean and visually perturbed observations.
  • Empirical evaluations on benchmarks like DMC-GB, Robotic Manipulation, and CARLA demonstrate significant performance gains, validating its modular invariance-driven approach.

Salience-Invariant Consistent Policy Learning (SCPL) is a rigorous framework for improving zero-shot generalization in visual deep reinforcement learning (RL). SCPL addresses the core challenge that policies trained on fixed visual environments often fail to generalize under changes in appearance (e.g., background, distractors, weather) due to the agent inadvertently exploiting spurious correlations. SCPL enforces invariance to visual salience shifts and consistency between policy/value predictions on both original and perturbed observations, and does so through a modular design encompassing value regularization, dynamics modeling, and policy consistency mechanisms (Sun et al., 12 Feb 2025).

1. Problem Formulation and Core Concepts

SCPL is formulated in the context of visually-partially observable Markov decision processes, M=⟨S,O,A,P,r,γ⟩M=\langle S,O,A,P,r,\gamma\rangle, where the agent receives image observations x∈Ox\in O—pixel renderings of latent states s∈Ss\in S. The principal generalization problem arises when training data xx and test-time perturbations x~\tilde x (via augmentations such as random convolution or overlays) differ in task-irrelevant visual content. The key objective is to learn policies πψ(⋅∣ϕ(x))\pi_\psi(\cdot|\phi(x)) robust to such visual domain shifts, where ϕ\phi is a convolutional encoder and ψ\psi are the policy parameters.

Central to SCPL is the notion of visual salience: a saliency map S(x)S(x) identifies high-task-relevance pixels by thresholding the gradients ∣∇xQζ(ϕ(x),a)∣|\nabla_x Q_\zeta(\phi(x),a)| with respect to the Q-value. This supports regularization to ensure that value-predictions, and hence the policy, are driven by task-relevant image features only.

2. Algorithmic Structure and Learning Objectives

The SCPL training objective is a sum of three principal modules, each targeting a distinct failure mode in visual RL generalization:

a) Value Consistency Module

The value consistency module enforces invariance of the critic across clean and perturbed images and focuses value predictions on salient, task-relevant regions. It introduces the following components:

  • Standard Bellman losses on both x∈Ox\in O0 and x∈Ox\in O1:

x∈Ox\in O2

where x∈Ox\in O3.

  • Saliency-guided consistency losses:

x∈Ox\in O4

x∈Ox\in O5

enforcing that the Q-value is unchanged when masking out non-salient pixels.

The total value loss is

x∈Ox\in O6

where x∈Ox\in O7 is a hyperparameter controlling the saliency loss strength.

b) Dynamics Module

To avoid representations focusing on static or reward-irrelevant features, SCPL utilizes a dynamics module with a next-embedding predictor x∈Ox\in O8 and a reward predictor x∈Ox\in O9. The objective jointly trains on clean and perturbed data:

s∈Ss\in S0

This module forces the encoder s∈Ss\in S1 to capture information pertinent to the environment’s controllable and reward-relevant structure.

c) Policy Consistency Module

A core theoretical insight is that generalization to unseen perturbations is limited by policy divergence across domains. SCPL constrains this using a KL divergence penalty:

s∈Ss\in S2

which is combined with the standard maximum entropy RL loss:

s∈Ss\in S3

forming the policy loss

s∈Ss\in S4

with s∈Ss\in S5 tuning the consistency regularization.

d) Overall Objective

All modules are trained jointly, solving

s∈Ss\in S6

with hyperparameters s∈Ss\in S7 and others per module.

3. Theoretical Properties

SCPL’s policy consistency regularizer is underpinned by a generalization bound. If s∈Ss\in S8 (the maximal per-embedding policy KL divergence) between policies on clean and perturbed observations, then the return gap is upper bounded:

s∈Ss\in S9

where xx0 is the maximal advantage function and xx1 is the discount factor. This result motivates KL control as a direct constraint on zero-shot generalization error.

4. Training Algorithm

The practical implementation of SCPL integrates seamlessly with off-policy SAC. The main steps per iteration are:

xx2 This algorithm leverages off-policy optimization, saliency-guided augmentation, and joint objective minimization.

5. Empirical Evaluation

SCPL’s effectiveness is demonstrated on several challenging control benchmarks:

Benchmark SCPL Avg. Score Best Baseline Relative Gain
DMC-GB Video Hard 853 ~747 14%
Robotic Manipulation 65.1 42.0 39%
CARLA (Test Avg.) 352 208 69%

SCPL is trained in a single visually-clean environment for 200K–500K steps and tested zero-shot in held-out visually shifted scenarios (e.g., novel backgrounds, weather). It consistently outperforms previous methods (SVEA, SGQN, MaDi, CNSN), validating the necessity of all three modules. Saliency-invariant value regularization plus dynamics-guided embedding and explicit policy consistency constitute an empirically robust approach to achieving high generalization in visual RL (Sun et al., 12 Feb 2025).

6. Relation to Invariant and Causal Policy Learning

The design of SCPL aligns with insights from causal-invariance literature (Saengkyongam et al., 2021). A policy is robust if it relies on d-invariant features—coordinates of the observation whose reward-relevance is stable across environments. Saliency-guided masking in SCPL empirically restricts evaluation and policy training to such invariant features by construction, sidestepping failure modes where agents overfit to task-irrelevant visual cues that shift across environments. This foundational link ensures that, under appropriate causal assumptions, SCPL-trained policies are maximally robust within the class of invariant policies.

7. Extensions and Variants

Salience-invariant and consistent policy learning has also been extended to unsupervised RL and successor representation settings, as in "Saliency-Guided Representation with Consistency Policy Learning" where an auxiliary saliency-guided dynamics head and one-step consistent policy estimation enable robust generalization in the zero-shot unsupervised regime (Sun et al., 7 Apr 2026). This variant further demonstrates the modularity and adaptability of the SCPL blueprint: saliency-guided representation learning, decoupled objectives, and policy consistency penalties together provide systematic generalization improvements across disparate control challenges.

8. Summary and Significance

SCPL advances zero-shot generalization in visual RL by regularizing value, dynamics, and policy modules to focus on task- and dynamics-relevant, saliency-invariant features. The approach leverages gradient-based saliency masking, data augmentation, and explicit policy regularization via KL divergence. The resulting framework is theoretically guaranteed—under bounded KL divergence—to limit generalization error, and experimentally establishes new state-of-the-art results in both simulated control and real-world-inspired robotics and driving benchmarks. Its principles integrate both classical invariance ideas and modern deep RL techniques, opening avenues for robust and transferable visuomotor agents (Sun et al., 12 Feb 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Salience-Invariant Consistent Policy Learning (SCPL).