Papers
Topics
Authors
Recent
Search
2000 character limit reached

Successor Feature Representations in RL

Updated 12 May 2026
  • Successor Feature Representations (SFR) are a formalism in reinforcement learning that decomposes the action-value function into dynamics-dependent successor features and reward weights.
  • They enable efficient transfer and zero-shot adaptation across tasks by separating invariant environmental dynamics from task-specific rewards.
  • SFR leverages algorithmic advances such as temporal-difference learning, deep neural networks, and risk-aware methods to enhance scalability and stability in complex domains.

Successor Feature Representations (SFR) constitute a central formalism in reinforcement learning (RL) for constructing value-predictive and transferable representations by decomposing expected returns into dynamics-dependent and reward-dependent components. SFR generalizes the notion of Successor Representations (SR) and Successor Features (SF), enhancing transferability and compositionality across tasks with varying reward functions or even environmental dynamics. This article presents a comprehensive review of foundational principles, mathematical structures, algorithmic instantiations, theoretical properties, empirical observations, and research challenges associated with SFR.

1. Mathematical Foundations and Formal Definition

The core principle of SFR is the decomposition of the action-value function Qπ(s,a)Q^\pi(s,a) for any policy π\pi in a Markov Decision Process (MDP) M=⟨S,A,p,r,γ⟩M = \langle \mathcal{S}, \mathcal{A}, p, r, \gamma \rangle into two parts: a "successor feature" ψπ(s,a)\psi^\pi(s,a) that captures the discounted occupancy of certain features under the policy and a "reward weight" ww, which encodes the task-specific reward signal.

When the reward is linear in a feature map ϕ:S×A×S→Rd\phi: \mathcal{S} \times \mathcal{A} \times \mathcal{S} \to \mathbb{R}^d,

r(s,a,s′)=ϕ(s,a,s′)⊤w,r(s,a,s') = \phi(s,a,s')^\top w,

the successor feature is defined as

ψπ(s,a)=Eπ[∑t=0∞γtϕ(st,at,st+1)∣s0=s,a0=a].\psi^\pi(s,a) = \mathbb{E}_\pi \left[ \sum_{t=0}^\infty \gamma^t \phi(s_t, a_t, s_{t+1}) \mid s_0 = s, a_0 = a \right].

The action-value function then factorizes as

Qπ(s,a)=ψπ(s,a)⊤w.Q^\pi(s,a) = \psi^\pi(s,a)^\top w.

This recursive structure leads to the Bellman-style equation: ψπ(s,a)=ϕ(s,a,s′)+γ Es′∼p(⋅∣s,a), a′∼π(⋅∣s′)[ψπ(s′,a′)].\psi^\pi(s,a) = \phi(s,a,s') + \gamma\, \mathbb{E}_{s' \sim p(\cdot|s,a),\, a' \sim \pi(\cdot|s')}[\psi^\pi(s',a')]. In the more general SFR formulation, one can represent π\pi0 for arbitrary, potentially nonlinear reward functions as

Ï€\pi1

where π\pi2 is the discounted occupancy probability of feature π\pi3 under π\pi4 (Reinke et al., 2021).

2. Algorithmic Implementations and Variants

A variety of algorithms have been proposed to estimate SFR from data, leveraging temporal-difference (TD) style learning, deep neural network function approximation, and convex-analytic constructs.

Tabular and Model-free Algorithms

  • Jointly minimize a reward regression loss Ï€\pi5 and a successor-feature TD loss:

Ï€\pi6

(Lehnert et al., 2017).

  • In modern deep RL, policy-dependent Ï€\pi7 and Ï€\pi8 are learned via backpropagation; the reward model weights Ï€\pi9 are typically fit by least-squares.

Full-Gradient and Distributional Approaches

  • Semi-gradient TD methods compute updates only through predictions, not targets, but can be unstable under function approximation. Full-gradient SFR algorithms (FG-SFRQL) optimize the full Mean Squared Bellman Error (MSBE) by back-propagating through both prediction and target, yielding convergence guarantees in multi-task and non-linear settings (Shrirao et al., 1 Apr 2026).
  • Distributional SFR approximators, such as the Categorical Successor Feature Approximator (CSFA), model each SF dimension as distribution over bins and use cross-entropy losses, improving stability in long-horizon and high-variance domains (Carvalho et al., 2023).

Modularity and Compositionality

  • Modular Successor Feature Approximators (MSFA) and the Successor Features Keyboard (SFK) architectures partition M=⟨S,A,p,r,γ⟩M = \langle \mathcal{S}, \mathcal{A}, p, r, \gamma \rangle0 and M=⟨S,A,p,r,γ⟩M = \langle \mathcal{S}, \mathcal{A}, p, r, \gamma \rangle1 into loosely coupled modules or combine them dynamically for zero-shot compositional transfer (Carvalho et al., 2023, Carvalho et al., 2023).
  • Temporal Representation Alignment (TRA) leverages a contrastive alignment loss between "instantaneous" and "future" feature representations to elicit compositional structure for robotics instruction following (Myers et al., 8 Feb 2025).

Generalization Beyond Linear Rewards

  • The SFR density approach (Reinke et al., 2021) extends classic SF to general reward functions M=⟨S,A,p,r,γ⟩M = \langle \mathcal{S}, \mathcal{A}, p, r, \gamma \rangle2 by learning the full discounted occupancy M=⟨S,A,p,r,γ⟩M = \langle \mathcal{S}, \mathcal{A}, p, r, \gamma \rangle3, allowing transfer and evaluation for arbitrary reward mappings on features.

3. Theoretical Properties and Guarantees

SFR yields several key theoretical properties directly traceable to the structure of RL value decompositions:

  • Transferability: Once M=⟨S,A,p,r,γ⟩M = \langle \mathcal{S}, \mathcal{A}, p, r, \gamma \rangle4 is learned, optimizing for a new reward M=⟨S,A,p,r,γ⟩M = \langle \mathcal{S}, \mathcal{A}, p, r, \gamma \rangle5 requires only solving a regression or updating M=⟨S,A,p,r,γ⟩M = \langle \mathcal{S}, \mathcal{A}, p, r, \gamma \rangle6, leading to immediate or rapid adaptation (Barreto et al., 2016, Lehnert et al., 2017).
  • Generalized Policy Improvement (GPI): Given a set of source policies M=⟨S,A,p,r,γ⟩M = \langle \mathcal{S}, \mathcal{A}, p, r, \gamma \rangle7, the GPI policy maximizes over M=⟨S,A,p,r,γ⟩M = \langle \mathcal{S}, \mathcal{A}, p, r, \gamma \rangle8 and M=⟨S,A,p,r,γ⟩M = \langle \mathcal{S}, \mathcal{A}, p, r, \gamma \rangle9:

ψπ(s,a)\psi^\pi(s,a)0

with theoretical performance lower bounds related to the minimal feature-space distance between the target ψπ(s,a)\psi^\pi(s,a)1 and source ψπ(s,a)\psi^\pi(s,a)2 (Barreto et al., 2016).

  • Convergence and Fixed-point Theorems: Both classic SF and generalized SFR policies are shown to converge under appropriate contractive Bellman or Bellman-like operators (Reinke et al., 2021).
  • Bisimulation and Reward-Predictive Representations: Feature maps that enable accurate one-step reward and transition prediction inherently satisfy bisimulation properties, enabling exact value transfer when underlying state abstractions are preserved (Lehnert et al., 2018, Lehnert et al., 2019).

4. Transfer, Compositionality, and Risk-Aware Learning

SFR-based methods are especially effective for zero-shot and efficient transfer across tasks:

  • Structure for Transfer: By decoupling ψπ(s,a)\psi^\pi(s,a)3 (dynamics-related) from ψπ(s,a)\psi^\pi(s,a)4 (reward), new tasks (new ψπ(s,a)\psi^\pi(s,a)5) exploit existing predictions without re-estimating ψπ(s,a)\psi^\pi(s,a)6 or exploring from scratch. Empirically, this gives near "zero-shot" transfer when new rewards are within the span of ψπ(s,a)\psi^\pi(s,a)7 (Lehnert et al., 2017, Barreto et al., 2016, Carvalho et al., 2023).
  • Risk and Uncertainty: Risk-aware SFRs expand the representation to include second-moment/covariance components, supporting transfer and policy improvement under risk-sensitive objectives via entropic utility (Gimelfarb et al., 2021).
  • Compositionality: Temporal alignment or modular SFR architectures yield representations that can be composed for generalization to new, compound or multi-step tasks, even in the absence of explicit hierarchical planning mechanisms (Myers et al., 8 Feb 2025, Carvalho et al., 2023, Carvalho et al., 2023).

Table: Key SFR Capabilities Across Research Lines

Property SFR/SF Classical SFR Density/Distributional Modular/Compositional
Linear reward transfer ✓ ✓ ✓
Nonlinear reward transfer × ✓ ✓ (if features suffice)
Policy composition Limited (GPI) ✓ (density SFR) ✓ (MSFA, TRA)
Risk-awareness × ✓ (RaSF) ×

5. Empirical Observations and Applications

Key empirical findings from the literature demonstrate that:

6. Current Limitations and Research Challenges

Despite the strengths of SFR, several limitations are identified:

  • Policy Dependence: ψπ(s,a)\psi^\pi(s,a)8 is tied to the policy under which it was learned; major changes in optimal policy (due to ψπ(s,a)\psi^\pi(s,a)9 or ww0 changes) undermine the quality of transfer (Lehnert et al., 2017).
  • Feature Expressivity: If ww1 does not span all relevant reward functions, transfer is bounded by the approximation error ww2 (Reinke et al., 2021).
  • Scaling: In large-scale or partially observable environments, learning or representing all relevant features for SFR remains challenging. Distributional and modular techniques alleviate but do not fully resolve scalability issues (Carvalho et al., 2023, Carvalho et al., 2023, Vertes et al., 2019).
  • Exploration/Exploitation Decoupling: In unsupervised pre-training, unifying intrinsic rewards can interfere with the SF–reward factorization required for subsequent transfer, motivating non-monolithic architectures (Kim et al., 2024).

7. Extensions and Open Research Directions

Recent developments and open problems in SFR research invite further investigation:

  • Task- and Policy-Invariant Representations: SFR variants that abstract away from fixed policies—to achieve greater generality or compositional power across large policy sets—are emerging (Brantley et al., 2021, Reinke et al., 2021).
  • Learned Cumulants and Self-Discovery: Automatic discovery of useful cumulant and feature spaces via deep architectures is shown to be more robust and efficient than hand-designed feature maps (Carvalho et al., 2023, Carvalho et al., 2023).
  • Integration with Hierarchical and Option-based RL: Embedding SFR in hierarchical frameworks, including options and sub-task planning, is an open problem for scaling to more complex tasks (Lehnert et al., 2017).
  • Risk-sensitive, Distributional, and Continual SFR: Extending SFR to encode higher moments, full return distributions, or continual adaptation in non-stationary environments remains an active area of research (Gimelfarb et al., 2021, Chua et al., 2024).
  • Neural and Cognitive Modeling: SFRs provide a normative substrate for hippocampal and entorhinal representations, suggesting directions for biologically plausible models and AI architectures (Vertes et al., 2019, Stoewer et al., 2022).

In summary, Successor Feature Representations provide a mathematically principled and empirically validated architecture for abstractions in RL that support rapid transfer, compositionality, risk-sensitivity, and biological plausibility. Research continues to extend their expressivity and stability in complex, high-dimensional, and transfer-rich domains, spanning both artificial agents and neural systems.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Successor Feature Representations (SFR).