Papers
Topics
Authors
Recent
Search
2000 character limit reached

Contrastive Successor Features in RL

Updated 3 July 2026
  • Contrastive Successor Features (CSF) are a unifying RL framework that integrate unsupervised skill discovery, contrastive representation, and temporal abstraction via successor features.
  • They employ contrastive mutual information maximization and InfoNCE losses to induce quasimetric temporal distances, enabling efficient exploration and combinatorial generalization.
  • Empirical findings show CSF outperforms state-of-the-art methods on continuous-control benchmarks, highlighting its potential for planning and hierarchical control.

Contrastive Successor Features (CSF) is a unifying framework in reinforcement learning (RL) that connects unsupervised skill discovery, contrastive representation learning, and temporal abstraction through the lens of successor features. CSF achieves efficient exploration, skill diversity, and control via a combination of contrastive mutual information maximization and successor feature critics, and it induces quasimetric temporal distances between states that directly support combinatorial generalization and planning. The following sections synthesize the technical formulation, methodological principles, algorithmic details, theoretical foundations, empirical findings, and open questions, drawing primarily from the formalizations in "Can a MISL Fly?" (Zheng et al., 2024) and "Learning Temporal Distances: Contrastive Successor Features Can Provide a Metric Structure for Decision-Making" (Myers et al., 2024).

1. Formal Definition: Successor Features in Contrastive Skill Learning

CSF combines classical successor features with modern contrastive learning objectives for unsupervised RL and goal-conditioned control. Let φ:SRd\varphi: \mathcal{S} \to \mathbb{R}^d be a learned embedding mapping states into a dd-dimensional feature space. For a state transition sss \rightarrow s', the instantaneous feature is defined as

f(s,s)φ(s)φ(s)Rd.f(s, s') \triangleq \varphi(s') - \varphi(s) \in \mathbb{R}^d.

A skill variable zRdz \in \mathbb{R}^d parameterizes intrinsic rewards via the inner product

r(s,s;z)=[φ(s)φ(s)]z.r(s, s'; z) = [\varphi(s') - \varphi(s)]^\top z.

The successor features for a policy π\pi and skill zz are defined recursively: ψπ(s,a,z)=EsP(s,a)[φ(s)φ(s)+γEaπ(s,z)[ψπ(s,a,z)]].\psi^\pi(s, a, z) = \mathbb{E}_{s' \sim P(\cdot \mid s, a)}\Big[ \varphi(s') - \varphi(s) + \gamma\,\mathbb{E}_{a' \sim \pi(\cdot \mid s', z)} [\psi^\pi(s', a', z)]\Big]. A parametric critic ψω(s,a,z)\psi_\omega(s, a, z) is fit by minimizing the squared Bellman error across transitions and skills. Policy updates maximize the expected intrinsic reward dd0, typically via off-policy RL such as SAC.

The same machinery, with encoder pairs dd1, generalizes to defining contrastive successor features for a state-action pair dd2 and a goal dd3 as dd4 which approximates dd5, where dd6 denotes discounted successor state density (Myers et al., 2024).

2. Contrastive Learning Objective and Mutual Information

CSF derives its representation learning objective from a contrastive lower bound on the mutual information (MI) dd7 under the behavior policy dd8: dd9 where sss \rightarrow s'0 is uniform on the sphere, sss \rightarrow s'1 is a scaling coefficient (sss \rightarrow s'2 recommended), and the second term implements in-batch negative sampling as a variant of InfoNCE. Proposition 2 in (Zheng et al., 2024) shows this is a second-order Taylor approximation of an InfoNCE lower bound on sss \rightarrow s'3. Policy-step updates are then tied to a lower bound on sss \rightarrow s'4 (information bottleneck), promoting discriminable skills without mode collapse.

For learning temporal distances, a similar InfoNCE-based contrastive objective is imposed on the critic sss \rightarrow s'5, with positive and negative samples defined via discounted future states and batch-based negatives, symmetrizing over both anchor and goal positions (Myers et al., 2024).

3. Induced Temporal Structure: Quasimetrics and Compositionality

By a change of variables, CSF-induced successor features can be converted into a temporal (quasimetric) distance: sss \rightarrow s'6 which in uncontrolled (single action) settings reduces to sss \rightarrow s'7. This temporal distance sss \rightarrow s'8 satisfies:

  • Non-negativity: sss \rightarrow s'9
  • Identity of indiscernibles: f(s,s)φ(s)φ(s)Rd.f(s, s') \triangleq \varphi(s') - \varphi(s) \in \mathbb{R}^d.0
  • Triangle inequality: f(s,s)φ(s)φ(s)Rd.f(s, s') \triangleq \varphi(s') - \varphi(s) \in \mathbb{R}^d.1

Thus, f(s,s)φ(s)φ(s)Rd.f(s, s') \triangleq \varphi(s') - \varphi(s) \in \mathbb{R}^d.2 is a quasimetric (not generally symmetric), which is critical for enabling compositional generalization and shortest-path planning even in stochastic environments (Myers et al., 2024). This property stands in contrast to prior temporal proximity estimators that violate the triangle inequality and block stitching of trajectories.

4. Algorithmic Realization and Implementation Specifics

The key implementation components for CSF include:

  • Policy f(s,s)φ(s)φ(s)Rd.f(s, s') \triangleq \varphi(s') - \varphi(s) \in \mathbb{R}^d.3: One-hidden-layer MLP (1024 hidden, tanh) trained via SAC (or PPO for select hierarchies) with automatic entropy tuning.
  • Representation f(s,s)φ(s)φ(s)Rd.f(s, s') \triangleq \varphi(s') - \varphi(s) \in \mathbb{R}^d.4: One-hidden-layer MLP (1024 hidden, ReLU) producing f(s,s)φ(s)φ(s)Rd.f(s, s') \triangleq \varphi(s') - \varphi(s) \in \mathbb{R}^d.5.
  • Successor features f(s,s)φ(s)φ(s)Rd.f(s, s') \triangleq \varphi(s') - \varphi(s) \in \mathbb{R}^d.6: One-hidden-layer MLP (1024 hidden, ReLU) outputting f(s,s)φ(s)φ(s)Rd.f(s, s') \triangleq \varphi(s') - \varphi(s) \in \mathbb{R}^d.7; maintains a target network f(s,s)φ(s)φ(s)Rd.f(s, s') \triangleq \varphi(s') - \varphi(s) \in \mathbb{R}^d.8 with EMA (f(s,s)φ(s)φ(s)Rd.f(s, s') \triangleq \varphi(s') - \varphi(s) \in \mathbb{R}^d.9).
  • Replay buffer: Batch size 256, typically with 50 gradient steps per 8 new trajectories, using 8 parallel actors.
  • Negative sampling: In-batch zRdz \in \mathbb{R}^d0 skills for the zRdz \in \mathbb{R}^d1 term in the contrastive loss.
  • Intrinsic reward computation: zRdz \in \mathbb{R}^d2.
  • Zero-shot skill inference: For goal zRdz \in \mathbb{R}^d3 from zRdz \in \mathbb{R}^d4, infer zRdz \in \mathbb{R}^d5.

In contrastive metric distillation (CMD-1) for temporal distances, the algorithm trains the critic zRdz \in \mathbb{R}^d6 using symmetric contrastive losses over zRdz \in \mathbb{R}^d7 and zRdz \in \mathbb{R}^d8, then extracts the quasimetric as zRdz \in \mathbb{R}^d9. The computational complexity is r(s,s;z)=[φ(s)φ(s)]z.r(s, s'; z) = [\varphi(s') - \varphi(s)]^\top z.0 per batch for InfoNCE, but can be reduced via approximations (Myers et al., 2024).

5. Empirical Performance and Ablations

CSF achieves or surpasses prior state-of-the-art unsupervised exploration, zero-shot goal reaching, and hierarchical control on six continuous-control benchmarks. Empirical ablations confirm:

  • The unconstrained contrastive loss in CSF (“METRA-C”) matches the exploration performance of the more complex METRA method.
  • Removing the "anti-exploration" term (negative actor reward) is essential to maintain exploration diversity; maximizing r(s,s;z)=[φ(s)φ(s)]z.r(s, s'; z) = [\varphi(s') - \varphi(s)]^\top z.1 unregularized collapses skills.
  • The specific inner-product critic r(s,s;z)=[φ(s)φ(s)]z.r(s, s'; z) = [\varphi(s') - \varphi(s)]^\top z.2 is crucial for performance; replacing it with deeper MLPs or kernelized forms degrades results.
  • Skill dimensionality r(s,s;z)=[φ(s)φ(s)]z.r(s, s'; z) = [\varphi(s') - \varphi(s)]^\top z.3 must be environment-specific; both METRA and CSF are sensitive to this value.
  • The scaling coefficient r(s,s;z)=[φ(s)φ(s)]z.r(s, s'; z) = [\varphi(s') - \varphi(s)]^\top z.4 (empirically r(s,s;z)=[φ(s)φ(s)]z.r(s, s'; z) = [\varphi(s') - \varphi(s)]^\top z.5) substantially boosts performance.

On goal-reaching and planning tasks, CMD-based methods relying on CSF-induced quasimetric distances demonstrate strong combinatorial generalization (“stitching” unseen state transitions) and sample efficiency superior to contrastive RL, Q-learning with HER, and behavior cloning (Myers et al., 2024).

Method AntMaze-umaze AntMaze-umaze-diverse AntMaze-large-diverse
CMD-1 90.3 ± 4.2 90.3 ± 4.6 78.0 ± 4.0
CMD-2 97.0 ± 0.4 90.5 ± 1.4 72.3 ± 2.6
Quasimetric RL 76.8 ± 2.3 80.1 ± 1.3 76.5 ± 2.1
CPC (CRL) 79.8 ± 1.6 77.6 ± 2.8 72.6 ± 2.9
GCBC 65.4 ± 8.7 60.9 ± 6.2 58.1 ± 7.2

6. Theoretical Properties and Guarantees

CSF is grounded in several theoretical findings:

  • The contrastive representation loss maximizes a variational lower bound on mutual information between transitions and skill variables (Prop. 2 in (Zheng et al., 2024)).
  • The induced quasimetric temporal distance satisfies non-negativity, identity of indiscernibles, and the triangle inequality (see Lemma 4.1 and following in (Myers et al., 2024)).
  • The information bottleneck linkage (Prop. 3 in (Zheng et al., 2024)) shows the intrinsic reward structure drives policy learning to maximize a lower bound on r(s,s;z)=[φ(s)φ(s)]z.r(s, s'; z) = [\varphi(s') - \varphi(s)]^\top z.6, penalized by information captured in the representation r(s,s;z)=[φ(s)φ(s)]z.r(s, s'; z) = [\varphi(s') - \varphi(s)]^\top z.7.
  • KKT analysis demonstrates that (in the METRA construction) the Wasserstein constraint is saturated, and the unconstrained contrastive loss of CSF achieves similar regularization implicitly.

This suggests CSF provides robust theoretical controls for skill diversity, transferability, and planning via induced state-space structure.

7. Extensions, Practical Considerations, and Open Questions

Several limitations and natural extensions are noted:

  • The quasimetric proofs assume discrete state spaces; full measure-theoretic generalizations to continuous domains are currently only empirical.
  • In non-ergodic MDPs or if r(s,s;z)=[φ(s)φ(s)]z.r(s, s'; z) = [\varphi(s') - \varphi(s)]^\top z.8, the induced distance diverges; proper handling of unreachable goals is required.
  • Estimating self-returns r(s,s;z)=[φ(s)φ(s)]z.r(s, s'; z) = [\varphi(s') - \varphi(s)]^\top z.9 may become noisy in high-dimensional spaces, potentially requiring improved estimation or bootstrapping techniques.
  • Integrating CSF distances into hierarchical RL or graph-based planning in the latent space is a natural extension.
  • Further work could leverage distributional RL to obtain not just expected transit times but concentration inequalities and uncertainty bounds on π\pi0, and to address irreversible (asymmetric) dynamics more deliberately.
  • Both METRA and CSF are sensitive to the dimension π\pi1 of the learned skills; selection must be tuned to environment complexity.

A plausible implication is that bridging CSF with scalable planning methods and hierarchical abstraction will further enhance combinatorial generalization and control in large-scale RL problems. CSF thus operationalizes a minimalistic, theoretically principled, and empirically validated approach to self-supervised skill learning, exploration, and planning (Zheng et al., 2024, Myers et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Contrastive Successor Features (CSF).