Successor Feature Representations in RL

Updated 12 May 2026

Successor Feature Representations (SFR) are a formalism in reinforcement learning that decomposes the action-value function into dynamics-dependent successor features and reward weights.
They enable efficient transfer and zero-shot adaptation across tasks by separating invariant environmental dynamics from task-specific rewards.
SFR leverages algorithmic advances such as temporal-difference learning, deep neural networks, and risk-aware methods to enhance scalability and stability in complex domains.

Successor Feature Representations (SFR) constitute a central formalism in reinforcement learning (RL) for constructing value-predictive and transferable representations by decomposing expected returns into dynamics-dependent and reward-dependent components. SFR generalizes the notion of Successor Representations (SR) and Successor Features (SF), enhancing transferability and compositionality across tasks with varying reward functions or even environmental dynamics. This article presents a comprehensive review of foundational principles, mathematical structures, algorithmic instantiations, theoretical properties, empirical observations, and research challenges associated with SFR.

1. Mathematical Foundations and Formal Definition

The core principle of SFR is the decomposition of the action-value function $Q^\pi(s,a)$ for any policy $\pi$ in a Markov Decision Process (MDP) $M = \langle \mathcal{S}, \mathcal{A}, p, r, \gamma \rangle$ into two parts: a "successor feature" $\psi^\pi(s,a)$ that captures the discounted occupancy of certain features under the policy and a "reward weight" $w$ , which encodes the task-specific reward signal.

When the reward is linear in a feature map $\phi: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \to \mathbb{R}^d$ ,

$r(s,a,s') = \phi(s,a,s')^\top w,$

the successor feature is defined as

$\psi^\pi(s,a) = \mathbb{E}_\pi \left[ \sum_{t=0}^\infty \gamma^t \phi(s_t, a_t, s_{t+1}) \mid s_0 = s, a_0 = a \right].$

The action-value function then factorizes as

$Q^\pi(s,a) = \psi^\pi(s,a)^\top w.$

This recursive structure leads to the Bellman-style equation: $\psi^\pi(s,a) = \phi(s,a,s') + \gamma\, \mathbb{E}_{s' \sim p(\cdot|s,a),\, a' \sim \pi(\cdot|s')}[\psi^\pi(s',a')].$ In the more general SFR formulation, one can represent $\pi$ 0 for arbitrary, potentially nonlinear reward functions as

$\pi$ 1

where $\pi$ 2 is the discounted occupancy probability of feature $\pi$ 3 under $\pi$ 4 (Reinke et al., 2021).

2. Algorithmic Implementations and Variants

A variety of algorithms have been proposed to estimate SFR from data, leveraging temporal-difference (TD) style learning, deep neural network function approximation, and convex-analytic constructs.

Tabular and Model-free Algorithms

Jointly minimize a reward regression loss $\pi$ 5 and a successor-feature TD loss:

$\pi$ 6

(Lehnert et al., 2017).

In modern deep RL, policy-dependent $\pi$ 7 and $\pi$ 8 are learned via backpropagation; the reward model weights $\pi$ 9 are typically fit by least-squares.

Full-Gradient and Distributional Approaches

Semi-gradient TD methods compute updates only through predictions, not targets, but can be unstable under function approximation. Full-gradient SFR algorithms (FG-SFRQL) optimize the full Mean Squared Bellman Error (MSBE) by back-propagating through both prediction and target, yielding convergence guarantees in multi-task and non-linear settings (Shrirao et al., 1 Apr 2026).
Distributional SFR approximators, such as the Categorical Successor Feature Approximator (CSFA), model each SF dimension as distribution over bins and use cross-entropy losses, improving stability in long-horizon and high-variance domains (Carvalho et al., 2023).

Modularity and Compositionality

Modular Successor Feature Approximators (MSFA) and the Successor Features Keyboard (SFK) architectures partition $M = \langle \mathcal{S}, \mathcal{A}, p, r, \gamma \rangle$ 0 and $M = \langle \mathcal{S}, \mathcal{A}, p, r, \gamma \rangle$ 1 into loosely coupled modules or combine them dynamically for zero-shot compositional transfer (Carvalho et al., 2023, Carvalho et al., 2023).
Temporal Representation Alignment (TRA) leverages a contrastive alignment loss between "instantaneous" and "future" feature representations to elicit compositional structure for robotics instruction following (Myers et al., 8 Feb 2025).

Generalization Beyond Linear Rewards

The SFR density approach (Reinke et al., 2021) extends classic SF to general reward functions $M = \langle \mathcal{S}, \mathcal{A}, p, r, \gamma \rangle$ 2 by learning the full discounted occupancy $M = \langle \mathcal{S}, \mathcal{A}, p, r, \gamma \rangle$ 3, allowing transfer and evaluation for arbitrary reward mappings on features.

3. Theoretical Properties and Guarantees

SFR yields several key theoretical properties directly traceable to the structure of RL value decompositions:

Transferability: Once $M = \langle \mathcal{S}, \mathcal{A}, p, r, \gamma \rangle$ 4 is learned, optimizing for a new reward $M = \langle \mathcal{S}, \mathcal{A}, p, r, \gamma \rangle$ 5 requires only solving a regression or updating $M = \langle \mathcal{S}, \mathcal{A}, p, r, \gamma \rangle$ 6, leading to immediate or rapid adaptation (Barreto et al., 2016, Lehnert et al., 2017).
Generalized Policy Improvement (GPI): Given a set of source policies $M = \langle \mathcal{S}, \mathcal{A}, p, r, \gamma \rangle$ 7, the GPI policy maximizes over $M = \langle \mathcal{S}, \mathcal{A}, p, r, \gamma \rangle$ 8 and $M = \langle \mathcal{S}, \mathcal{A}, p, r, \gamma \rangle$ 9:

$\psi^\pi(s,a)$ 0

with theoretical performance lower bounds related to the minimal feature-space distance between the target $\psi^\pi(s,a)$ 1 and source $\psi^\pi(s,a)$ 2 (Barreto et al., 2016).

Convergence and Fixed-point Theorems: Both classic SF and generalized SFR policies are shown to converge under appropriate contractive Bellman or Bellman-like operators (Reinke et al., 2021).
Bisimulation and Reward-Predictive Representations: Feature maps that enable accurate one-step reward and transition prediction inherently satisfy bisimulation properties, enabling exact value transfer when underlying state abstractions are preserved (Lehnert et al., 2018, Lehnert et al., 2019).

4. Transfer, Compositionality, and Risk-Aware Learning

SFR-based methods are especially effective for zero-shot and efficient transfer across tasks:

Structure for Transfer: By decoupling $\psi^\pi(s,a)$ 3 (dynamics-related) from $\psi^\pi(s,a)$ 4 (reward), new tasks (new $\psi^\pi(s,a)$ 5) exploit existing predictions without re-estimating $\psi^\pi(s,a)$ 6 or exploring from scratch. Empirically, this gives near "zero-shot" transfer when new rewards are within the span of $\psi^\pi(s,a)$ 7 (Lehnert et al., 2017, Barreto et al., 2016, Carvalho et al., 2023).
Risk and Uncertainty: Risk-aware SFRs expand the representation to include second-moment/covariance components, supporting transfer and policy improvement under risk-sensitive objectives via entropic utility (Gimelfarb et al., 2021).
Compositionality: Temporal alignment or modular SFR architectures yield representations that can be composed for generalization to new, compound or multi-step tasks, even in the absence of explicit hierarchical planning mechanisms (Myers et al., 8 Feb 2025, Carvalho et al., 2023, Carvalho et al., 2023).

Table: Key SFR Capabilities Across Research Lines

Property	SFR/SF Classical	SFR Density/Distributional	Modular/Compositional
Linear reward transfer	✓	✓	✓
Nonlinear reward transfer	×	✓	✓ (if features suffice)
Policy composition	Limited (GPI)	✓ (density SFR)	✓ (MSFA, TRA)
Risk-awareness	×	✓ (RaSF)	×

5. Empirical Observations and Applications

Key empirical findings from the literature demonstrate that:

SFR-based methods generally match or surpass baseline approaches for transfer between related tasks in gridworld navigation, continuous robotic control, and high-dimensional domains, especially when reward functions change (Lehnert et al., 2017, Chua et al., 2024, Shrirao et al., 1 Apr 2026).
Modular SFR and distributional SFR learning are required for scalability, stability, and effective transfer in complex environments such as 3D manipulation or multi-stage tasks (Carvalho et al., 2023, Carvalho et al., 2023).
SFR enables rapid adaptation in continual learning and domain switch scenarios, with lower sample complexity and less catastrophic forgetting than conventional RL methods (Chua et al., 2024).
Neural-network-approximated SFRs in both spatial and non-spatial domains yield internal representations reminiscent of biological place and grid cells, aligning with findings in systems neuroscience (Stoewer et al., 2022, Vertes et al., 2019).

6. Current Limitations and Research Challenges

Despite the strengths of SFR, several limitations are identified:

Policy Dependence: $\psi^\pi(s,a)$ 8 is tied to the policy under which it was learned; major changes in optimal policy (due to $\psi^\pi(s,a)$ 9 or $w$ 0 changes) undermine the quality of transfer (Lehnert et al., 2017).
Feature Expressivity: If $w$ 1 does not span all relevant reward functions, transfer is bounded by the approximation error $w$ 2 (Reinke et al., 2021).
Scaling: In large-scale or partially observable environments, learning or representing all relevant features for SFR remains challenging. Distributional and modular techniques alleviate but do not fully resolve scalability issues (Carvalho et al., 2023, Carvalho et al., 2023, Vertes et al., 2019).
Exploration/Exploitation Decoupling: In unsupervised pre-training, unifying intrinsic rewards can interfere with the SF–reward factorization required for subsequent transfer, motivating non-monolithic architectures (Kim et al., 2024).

7. Extensions and Open Research Directions

Recent developments and open problems in SFR research invite further investigation:

Task- and Policy-Invariant Representations: SFR variants that abstract away from fixed policies—to achieve greater generality or compositional power across large policy sets—are emerging (Brantley et al., 2021, Reinke et al., 2021).
Learned Cumulants and Self-Discovery: Automatic discovery of useful cumulant and feature spaces via deep architectures is shown to be more robust and efficient than hand-designed feature maps (Carvalho et al., 2023, Carvalho et al., 2023).
Integration with Hierarchical and Option-based RL: Embedding SFR in hierarchical frameworks, including options and sub-task planning, is an open problem for scaling to more complex tasks (Lehnert et al., 2017).
Risk-sensitive, Distributional, and Continual SFR: Extending SFR to encode higher moments, full return distributions, or continual adaptation in non-stationary environments remains an active area of research (Gimelfarb et al., 2021, Chua et al., 2024).
Neural and Cognitive Modeling: SFRs provide a normative substrate for hippocampal and entorhinal representations, suggesting directions for biologically plausible models and AI architectures (Vertes et al., 2019, Stoewer et al., 2022).

In summary, Successor Feature Representations provide a mathematically principled and empirically validated architecture for abstractions in RL that support rapid transfer, compositionality, risk-sensitivity, and biological plausibility. Research continues to extend their expressivity and stability in complex, high-dimensional, and transfer-rich domains, spanning both artificial agents and neural systems.