Successor Features in RL

Updated 15 January 2026

Successor Features are a reinforcement learning abstraction that decomposes the action-value function by separating environment dynamics from the reward structure.
They enable principled skill transfer and modular policy composition through a linear factorization of future outcomes, offering provable performance guarantees.
Deep and modular architectures implement SFs using techniques like L2 normalization and joint optimization to achieve robust, sample-efficient learning in high-dimensional settings.

Successor Features (SFs) are a foundational abstraction in reinforcement learning (RL) that decomposes the action-value function into environment dynamics and reward structure, enabling principled skill transfer, modular policy composition, and sample-efficient adaptation to downstream tasks. Underpinning much of the contemporary research on RL transfer, SFs isolate transferable aspects of agent behavior, offering a linear factorization of future outcomes and facilitating algorithms with provable performance guarantees and state-of-the-art practical performance.

1. Formal Definition and Theoretical Basis

Given an MDP $(\mathcal S, \mathcal A, p, r, \gamma)$ and a feature map $\phi:\mathcal S\times\mathcal A\times\mathcal S\to\mathbb R^d$ , SFs are defined for a fixed policy $\pi$ as the expected discounted sum of future features:

$\psi^\pi(s,a) = \mathbb E_\pi\left[\,\sum_{t=0}^{\infty}\gamma^t\, \phi(s_t,a_t,s_{t+1})\,\Big| s_0=s, a_0=a\right]$

Assuming the reward can be expressed linearly, $r(s,a,s') = \phi(s,a,s')^\top w$ , the action-value decomposes as:

$Q_w^\pi(s,a) = \mathbb E_\pi\left[\,\sum_{t=0}^\infty\gamma^t\,r(s_t,a_t,s_{t+1})\right] = \psi^\pi(s,a)^\top w$

The Bellman equation for SFs is:

$\psi^\pi(s,a) = \mathbb E_{s'\sim p(\cdot|s,a)}\left[\phi(s,a,s') + \gamma\,\mathbb E_{a'\sim\pi(\cdot|s')}\psi^\pi(s',a')\right]$

This factorization enables the decoupling of transition statistics (captured by $\psi^\pi$ ) from reward parameterization ( $w$ ), so that adapting to novel rewards amounts to a single inner product without re-solving the MDP (Barreto et al., 2016, Barreto et al., 2019, Kim et al., 2024).

2. Generalized Policy Improvement (GPI) and Transfer Guarantees

The SF framework provides the basis for Generalized Policy Improvement (GPI): given a collection of policies $\{\pi_i\}$ and their SFs $\{\psi^{\pi_i}\}$ , for any new task weight $w'$ , an improved policy is defined via:

$\pi_{\mathrm{GPI}}(s) \in \arg\max_a \max_i \psi^{\pi_i}(s,a)^\top w'$

GPI has the guarantee that the induced policy's value is at least as large as the best amongst the base policies, minus a bound that scales with the SF and reward-parameter approximation errors:

$Q_{w'}^{\pi_{\mathrm{GPI}}} (s,a) \ge \max_i Q_{w'}^{\pi_i}(s,a) - \frac{2\epsilon}{1-\gamma}$

When reward functions deviating from the linear span are encountered, transfer performance degrades smoothly, with suboptimality bounded in terms of the minimal reward-approximation error plus any SF estimation error (Barreto et al., 2016, Barreto et al., 2019, Feng et al., 2022).

3. Deep and Modular Architectures for SF Learning

SFs have been implemented in deep RL via parameterized function approximators. Architectures typically consist of a shared encoder (processing images or states), feature heads for $\phi$ , and separate MLPs for each base task's SF predictions (Barreto et al., 2019, Chua et al., 2024). Special attention is required to avoid representation collapse when learning from high-dimensional observations—techniques include:

L2 normalization of features, stop-gradient in reward prediction loss, and joint optimization of TD and feature-prediction losses (Chua et al., 2024).
Modular SF architectures (MSFA) leveraging per-module cumulant and SF heads to discover task-composable representations and enable robust generalization without manual feature engineering (Carvalho et al., 2023).
Categorical Successor Feature Approximators (CSFA) for stable learning of SFs and task-encodings in large-scale 3D environments (Carvalho et al., 2023).

A summary table of main architectural approaches:

Approach/Reference	Key Feature Learning Strategy	Transfer Properties
Deep SF + TD/Reward Loss (Chua et al., 2024)	L2-normalized $\phi$ , stop-gradient reward fit	Fast, robust transfer in pixels; avoids collapse
Modular SF (MSFA) (Carvalho et al., 2023)	Task discovery via per-module cumulants	Out-of-distribution generalization, zero-shot
CSFA (Carvalho et al., 2023)	Categorical distribution over SFs	Large-scale transfer with discovered features

4. Extensions: Policy Composition, Safety, and Risk

Recent work generalizes SFs to more expressive settings:

Concurrent Policy Composition: SFs support direct composition, where new policies are synthesized “online” using combinations of primitives' SFs via max, sum, or coordinatewise rules, and action selection is implemented via multiplicative mixtures or GPI-like strategies in continuous action spaces. These techniques yield substantial transfer gains and real-time policy synthesis capabilities (Liu et al., 2023).

Safe RL and Risk Awareness: SFs can be extended to Constrained MDPs (CMDPs) by separately decomposing both reward and cost functions in a common feature space, enabling Lagrangian-based constrained GPI with theoretical performance bounds (Feng et al., 2022). Risk-aware SFs (RaSF) further augment the representation to capture return variances, enabling transfer under entropic or mean-variance utility objectives with explicit risk/reward trade-offs (Gimelfarb et al., 2021).

5. Feature Selection and Universality

A central question for universal SFs is which base feature set maximizes downstream performance for broad task families. The optimal base features are not generally Laplacian eigenfunctions but the top eigenvectors of a task-family-dependent “advantage-kernel” operator, e.g., $A_{\pi_0} = \Delta^{-1} + (\Delta^{-1})^* - I$ in deterministic MDPs ( $\Delta = I - \gamma P_{\pi_0}$ ) (Ollivier, 15 Feb 2025). This choice ensures maximal zero-shot adaptation under KL-regularized natural policy gradient, extending universality even to downstream tasks that are not in the linear span of the features.

6. Relaxing Linearity: Successor Feature Representations and Nonlinear Transfer

The restriction of linearly-parameterized rewards limits standard SFs; recent proposals define Successor Feature Representations (SFR), where the agent learns the discounted distribution (density) over future features, supporting arbitrary reward functionals:

$Q^\pi(s,a) = \int \xi^\pi(s,a,\phi) R(\phi)d\phi$

with Bellman update

$\xi^\pi(s,a,\phi) = p(\phi|s,a) + \gamma \mathbb{E}_{s'|s,a} [ \xi^\pi(s',\pi(s'),\phi) ]$

SFRs provably converge and outperform classical SFs in nonlinear transfer settings, as demonstrated empirically (Reinke et al., 2021). This extension aligns SFs with fully predictive state-representation frameworks.

7. Empirical Impact, Robustness, and Open Questions

SF-based algorithms yield substantial gains in sample efficiency, transfer speed, and policy reuse across discrete/continuous, observation-rich, and noisy domains (Barreto et al., 2016, Chua et al., 2024, Lee, 2023). SFs demonstrate superior noise resilience relative to alternatives and maintain efficiency in high-dimensional robotics and visual navigation benchmarks.

Open research challenges and current frontiers include:

Extending universal SFs to real-world, temporally extended, or language-conditioned tasks (beyond linear reward maps) (Carvalho et al., 2023, Carvalho et al., 2023).
Understanding the tradeoffs in base feature selection and optimizing representations for diverse, possibly nonlinear or safety-critical, downstream objectives (Ollivier, 15 Feb 2025, Feng et al., 2022).
Scalable, robust SF learning from raw observations and in the presence of partial observability or non-i.i.d. data streams (Dzhivelikian et al., 2023, Hoang et al., 2021).
Integrating contrastive and other auxiliary objectives for richer, non-collapsing state representations (Chua et al., 2024).

Together, Successor Features constitute a mathematically principled and empirically validated approach to knowledge transfer in RL, providing a backbone for skill reuse, modular policy synthesis, and continual adaptation in both classic and deep RL settings.