Contrastive Successor Features in RL
- Contrastive Successor Features (CSF) are a unifying RL framework that integrate unsupervised skill discovery, contrastive representation, and temporal abstraction via successor features.
- They employ contrastive mutual information maximization and InfoNCE losses to induce quasimetric temporal distances, enabling efficient exploration and combinatorial generalization.
- Empirical findings show CSF outperforms state-of-the-art methods on continuous-control benchmarks, highlighting its potential for planning and hierarchical control.
Contrastive Successor Features (CSF) is a unifying framework in reinforcement learning (RL) that connects unsupervised skill discovery, contrastive representation learning, and temporal abstraction through the lens of successor features. CSF achieves efficient exploration, skill diversity, and control via a combination of contrastive mutual information maximization and successor feature critics, and it induces quasimetric temporal distances between states that directly support combinatorial generalization and planning. The following sections synthesize the technical formulation, methodological principles, algorithmic details, theoretical foundations, empirical findings, and open questions, drawing primarily from the formalizations in "Can a MISL Fly?" (Zheng et al., 2024) and "Learning Temporal Distances: Contrastive Successor Features Can Provide a Metric Structure for Decision-Making" (Myers et al., 2024).
1. Formal Definition: Successor Features in Contrastive Skill Learning
CSF combines classical successor features with modern contrastive learning objectives for unsupervised RL and goal-conditioned control. Let be a learned embedding mapping states into a -dimensional feature space. For a state transition , the instantaneous feature is defined as
A skill variable parameterizes intrinsic rewards via the inner product
The successor features for a policy and skill are defined recursively: A parametric critic is fit by minimizing the squared Bellman error across transitions and skills. Policy updates maximize the expected intrinsic reward 0, typically via off-policy RL such as SAC.
The same machinery, with encoder pairs 1, generalizes to defining contrastive successor features for a state-action pair 2 and a goal 3 as 4 which approximates 5, where 6 denotes discounted successor state density (Myers et al., 2024).
2. Contrastive Learning Objective and Mutual Information
CSF derives its representation learning objective from a contrastive lower bound on the mutual information (MI) 7 under the behavior policy 8: 9 where 0 is uniform on the sphere, 1 is a scaling coefficient (2 recommended), and the second term implements in-batch negative sampling as a variant of InfoNCE. Proposition 2 in (Zheng et al., 2024) shows this is a second-order Taylor approximation of an InfoNCE lower bound on 3. Policy-step updates are then tied to a lower bound on 4 (information bottleneck), promoting discriminable skills without mode collapse.
For learning temporal distances, a similar InfoNCE-based contrastive objective is imposed on the critic 5, with positive and negative samples defined via discounted future states and batch-based negatives, symmetrizing over both anchor and goal positions (Myers et al., 2024).
3. Induced Temporal Structure: Quasimetrics and Compositionality
By a change of variables, CSF-induced successor features can be converted into a temporal (quasimetric) distance: 6 which in uncontrolled (single action) settings reduces to 7. This temporal distance 8 satisfies:
- Non-negativity: 9
- Identity of indiscernibles: 0
- Triangle inequality: 1
Thus, 2 is a quasimetric (not generally symmetric), which is critical for enabling compositional generalization and shortest-path planning even in stochastic environments (Myers et al., 2024). This property stands in contrast to prior temporal proximity estimators that violate the triangle inequality and block stitching of trajectories.
4. Algorithmic Realization and Implementation Specifics
The key implementation components for CSF include:
- Policy 3: One-hidden-layer MLP (1024 hidden, tanh) trained via SAC (or PPO for select hierarchies) with automatic entropy tuning.
- Representation 4: One-hidden-layer MLP (1024 hidden, ReLU) producing 5.
- Successor features 6: One-hidden-layer MLP (1024 hidden, ReLU) outputting 7; maintains a target network 8 with EMA (9).
- Replay buffer: Batch size 256, typically with 50 gradient steps per 8 new trajectories, using 8 parallel actors.
- Negative sampling: In-batch 0 skills for the 1 term in the contrastive loss.
- Intrinsic reward computation: 2.
- Zero-shot skill inference: For goal 3 from 4, infer 5.
In contrastive metric distillation (CMD-1) for temporal distances, the algorithm trains the critic 6 using symmetric contrastive losses over 7 and 8, then extracts the quasimetric as 9. The computational complexity is 0 per batch for InfoNCE, but can be reduced via approximations (Myers et al., 2024).
5. Empirical Performance and Ablations
CSF achieves or surpasses prior state-of-the-art unsupervised exploration, zero-shot goal reaching, and hierarchical control on six continuous-control benchmarks. Empirical ablations confirm:
- The unconstrained contrastive loss in CSF (“METRA-C”) matches the exploration performance of the more complex METRA method.
- Removing the "anti-exploration" term (negative actor reward) is essential to maintain exploration diversity; maximizing 1 unregularized collapses skills.
- The specific inner-product critic 2 is crucial for performance; replacing it with deeper MLPs or kernelized forms degrades results.
- Skill dimensionality 3 must be environment-specific; both METRA and CSF are sensitive to this value.
- The scaling coefficient 4 (empirically 5) substantially boosts performance.
On goal-reaching and planning tasks, CMD-based methods relying on CSF-induced quasimetric distances demonstrate strong combinatorial generalization (“stitching” unseen state transitions) and sample efficiency superior to contrastive RL, Q-learning with HER, and behavior cloning (Myers et al., 2024).
| Method | AntMaze-umaze | AntMaze-umaze-diverse | AntMaze-large-diverse |
|---|---|---|---|
| CMD-1 | 90.3 ± 4.2 | 90.3 ± 4.6 | 78.0 ± 4.0 |
| CMD-2 | 97.0 ± 0.4 | 90.5 ± 1.4 | 72.3 ± 2.6 |
| Quasimetric RL | 76.8 ± 2.3 | 80.1 ± 1.3 | 76.5 ± 2.1 |
| CPC (CRL) | 79.8 ± 1.6 | 77.6 ± 2.8 | 72.6 ± 2.9 |
| GCBC | 65.4 ± 8.7 | 60.9 ± 6.2 | 58.1 ± 7.2 |
6. Theoretical Properties and Guarantees
CSF is grounded in several theoretical findings:
- The contrastive representation loss maximizes a variational lower bound on mutual information between transitions and skill variables (Prop. 2 in (Zheng et al., 2024)).
- The induced quasimetric temporal distance satisfies non-negativity, identity of indiscernibles, and the triangle inequality (see Lemma 4.1 and following in (Myers et al., 2024)).
- The information bottleneck linkage (Prop. 3 in (Zheng et al., 2024)) shows the intrinsic reward structure drives policy learning to maximize a lower bound on 6, penalized by information captured in the representation 7.
- KKT analysis demonstrates that (in the METRA construction) the Wasserstein constraint is saturated, and the unconstrained contrastive loss of CSF achieves similar regularization implicitly.
This suggests CSF provides robust theoretical controls for skill diversity, transferability, and planning via induced state-space structure.
7. Extensions, Practical Considerations, and Open Questions
Several limitations and natural extensions are noted:
- The quasimetric proofs assume discrete state spaces; full measure-theoretic generalizations to continuous domains are currently only empirical.
- In non-ergodic MDPs or if 8, the induced distance diverges; proper handling of unreachable goals is required.
- Estimating self-returns 9 may become noisy in high-dimensional spaces, potentially requiring improved estimation or bootstrapping techniques.
- Integrating CSF distances into hierarchical RL or graph-based planning in the latent space is a natural extension.
- Further work could leverage distributional RL to obtain not just expected transit times but concentration inequalities and uncertainty bounds on 0, and to address irreversible (asymmetric) dynamics more deliberately.
- Both METRA and CSF are sensitive to the dimension 1 of the learned skills; selection must be tuned to environment complexity.
A plausible implication is that bridging CSF with scalable planning methods and hierarchical abstraction will further enhance combinatorial generalization and control in large-scale RL problems. CSF thus operationalizes a minimalistic, theoretically principled, and empirically validated approach to self-supervised skill learning, exploration, and planning (Zheng et al., 2024, Myers et al., 2024).