Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 148 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 34 tok/s Pro
GPT-5 High 40 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 183 tok/s Pro
GPT OSS 120B 443 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

Successor Representations in Reinforcement Learning

Updated 11 November 2025
  • Successor Representations (SRs) are defined as the discounted expected visitation counts of states under a fixed policy, serving as a predictive model in RL.
  • SRs enable rapid reward revaluation and efficient transfer across tasks by decoupling value functions from reward structures.
  • Deep and feature-based extensions of SRs scale to high-dimensional spaces, facilitating intrinsic exploration and option discovery.

Successor representations (SRs) form a predictive model of the expected future occupancy of states under a fixed policy, providing a middle ground between model-based and model-free approaches to reinforcement learning (RL). The SR defines for each state (or feature) the discounted expected visitation counts of all possible successors under the current policy, enabling rapid adaptation to changed reward structures and facilitating efficient option discovery, temporal abstraction, exploration, and transfer across goals or tasks.

1. Formal Definition and Mathematical Properties

Let S\mathcal{S} denote the (possibly finite) state space, π\pi a stationary policy, and γ[0,1)\gamma\in[0,1) a discount factor. The canonical (tabular) SR Mπ:S×SRM^\pi: \mathcal{S}\times\mathcal{S}\to\mathbb{R} is defined as: Mπ(s,s)=Eπ[t=0γt1{St=s}S0=s]M^\pi(s, s') = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t\, \mathbb{1}\{S_t = s'\} \mid S_0 = s\right] where StS_t denotes the state at time tt under dynamics pp and π\pi, and 1\mathbb{1} the indicator function. Mπ(s,s)M^\pi(s,s') quantifies the expected discounted number of times state ss' will be visited in the future after starting from ss.

Writing MπM^\pi in matrix form (with PπP^\pi as the policy-induced one-step transition matrix): Mπ=t=0(γPπ)t=(IγPπ)1M^\pi = \sum_{t=0}^\infty (\gamma P^\pi)^t = (I - \gamma P^\pi)^{-1}

MπM^\pi satisfies the Bellman-style fixed-point equation: Mπ=I+γPπMπM^\pi = I + \gamma P^\pi M^\pi

This structure underpins several key uses:

  • The value function for any reward rr (vector) is Vπ=MπrV^\pi = M^\pi\, r.
  • Rapid reward revaluation: If rr changes but PπP^\pi does not, update VπV^\pi by a single matrix-vector multiplication.
  • Extensions to feature-based versions, where each state ss is mapped to ϕ(s)Rd\phi(s)\in\mathbb{R}^d and successor features ψπ(s)=Eπ[t=0γtϕ(St)S0=s]\psi^\pi(s) = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t \phi(S_t)\mid S_0 = s\right] are used directly for large or continuous spaces.

2. SRs in Deep and Feature-based Architectures

SR theory has been extended to address high-dimensional or continuous state spaces with deep neural architectures. In such cases, SRs are not stored as explicit S×S|\mathcal{S}|\times|\mathcal{S}| matrices, but as feature-based predictors or networks:

  • State embeddings ϕ(s;θϕ)Rd\phi(s;\theta_\phi)\in\mathbb{R}^d are learned by a CNN or MLP for visual input.
  • Successor feature modules ψ(s;θψ)\psi(s;\theta_\psi) approximate Eπ[t=0γtϕ(St)S0=s]\mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t \phi(S_t)\mid S_0 = s\right].
  • Temporal-difference (TD) learning is employed to iteratively update ψ\psi with the loss:

LSR(s,s)=ϕ(s)+γψ(ϕ(s))ψ(ϕ(s))22L_\text{SR}(s,s') = \left\| \phi^-(s) + \gamma\, \psi^-(\phi^-(s')) - \psi(\phi(s)) \right\|_2^2

where (ϕ,ψ)(\phi^-, \psi^-) are parameters from slowly updated target networks.

This structure is used within both DQN-like value learning (Kulkarni et al., 2016) and actor-critic agents (Siriwardhana et al., 2018), often augmented with auxiliary tasks (e.g., next-frame prediction) to stabilize feature learning.

3. SRs and Option/Eigenoption Discovery

The SR exhibits deep connections to proto-value functions (PVFs) and graph Laplacians, enabling principled construction of temporally-extended options ("eigenoptions"):

  • In graph-theoretic terms, the eigenvectors of the SR matrix (corresponding to the largest eigenvalues) encode directions of diffusive information flow, coinciding with the smoothest eigenfunctions of the normalized Laplacian:

L=D1/2(DW)D1/2\mathcal{L} = D^{-1/2}(D - W) D^{-1/2}

where DD is the degree matrix and WW the adjacency matrix.

  • Eigenoptions are discovered by:
    1. Collecting SR/feature vectors ψ(ϕ(st))\psi(\phi(s_t)) over a rollout.
    2. Performing eigendecomposition to extract leading eigenvectors.
    3. Defining intrinsic rewards re(s,s)=ei[ϕ(s)ϕ(s)]r^e(s,s') = e_i^\top [\phi(s') - \phi(s)] per eigenvector.
    4. Training option policies to maximize each rer^e.
    5. Defining option initiation and termination sets using QQ-values with respect to rer^e.

Empirical evidence (Machado et al., 2017) demonstrates that adding a small number of eigenoptions constructed from SRs sharply reduces diffusion times in navigation tasks and accelerates goal-reaching, even with raw pixel inputs.

4. Successor Features, Universal Successor Representations, and Task Transfer

Beyond simple SRs, the successor feature (SF) or universal successor representation (USR) framework further factorizes value functions:

  • Define ϕ(s,a,s)Rd\phi(s, a, s')\in\mathbb{R}^d and, for a policy π\pi,

ψπ(s,a)=Eπ[t=0γtϕ(st,at,st+1)s0=s,a0=a]\psi^\pi(s,a) = \mathbb{E}_\pi\left[ \sum_{t=0}^{\infty} \gamma^t\, \phi(s_t, a_t, s_{t+1}) \mid s_0=s, a_0=a \right]

  • For reward functions r(s,a,s)=ϕ(s,a,s)wr(s,a,s') = \phi(s,a,s')^\top w parameterized by ww, the QQ-function is Qπ(s,a)=ψπ(s,a)wQ^\pi(s,a) = \psi^\pi(s,a)^\top w.
  • USRs further generalize ψπ\psi^\pi to be goal-conditional, so that adaptation to new rewards/goals is achieved by learning/updating only ww or w(g)w(g), leaving ψ()\psi(\cdot) unchanged.

Transfer results:

  • Once SR/SF ψ\psi is learned for the environment, adaptation to new tasks with the same dynamics but different ww requires only quick adaptation of ww (or w(g)w(g)).
  • Empirically, in environments like AI2THOR, task transfer via SR adaptation reduces required learning episodes by an order of magnitude compared to full network retraining (Zhu et al., 2017, Ma et al., 2018).
  • Theoretical bounds show that transfer loss is proportional to ww\| w - w' \| between old and new task weights.

5. Exploration and Count-based Intrinsic Reward via the SR

SRs can be used to incentivize exploration without resorting to explicit density models:

  • The L1L_1-norm ψ(s)1\|\psi(s)\|_1 (for tabular SR) or ψ(s)\|\psi(s)\| under learned features quantifies the expected cumulative visitation of state ss.
  • The (inverse) norm can be used as a count-based exploration bonus:

r+(s)=β(1/ψ(s)p)r^+(s) = \beta \left(1 / \| \psi(s) \|_p \right)

where β\beta is a tuning parameter.

  • The substochastic SR (SSR) variant analytically relates the SR norm to empirical visitation counts, providing a justification for this bonus (Machado et al., 2018).
  • Algorithms using SR-norm-based bonuses achieve order-of-magnitude gains in sparse-reward tasks and match R-Max/E3^3 style sample complexity.

In deep RL, SR-derived intrinsic rewards match or outperform more complex density-model approaches (e.g., PixelCNN, CTS, RND) in sparse Atari environments—particularly in the low-sample regime (Machado et al., 2018).

6. Advanced Theoretical Extensions and Empirical Results

Recent directions include:

  • Probabilistic/Uncertainty-aware SRs: Kalman Temporal Differences (KTD) for SR give a posterior distribution (mean and covariance) over MM, capturing uncertainty and covariances. This results in nonlocal updates and partial transition revaluation, matching human credit-assignment in chains (Geerts et al., 2019).
  • Distributional/Partially Observable SRs: Distributional codes for SR enable value computation and policy derivation when state is not directly observable; learning is achieved with biologically plausible local synaptic updates (Vertes et al., 2019).
  • Active Inference and SRs: SRs offer an efficient amortization for Active Inference agents by precomputing MπM^\pi once and enabling instantaneous value reevaluation for new priors or expected free energy objectives. This significantly reduces planning costs in large discrete state spaces (Millidge et al., 2022).
  • Temporal Abstractions (t-SR): The t-SR framework generalizes SRs to temporally extended actions (repeat-elsewhere operators), reducing policy-sampling frequency and accelerating reward revaluation in dynamic environments (Sargent et al., 2022).
  • Exploration Maximizing State Entropy: Conditioning SRs on the explicit past trajectory enables maximizing the entropy of the whole single-episode visitation distribution, systematically driving policies to explore previously unseen states (Jain et al., 2023).

Empirical observations:

  • SR-based bottleneck and option extraction regularly identifies semantically meaningful subgoals (e.g., room doorways) (Kulkarni et al., 2016).
  • Neural-network-based SRs recover place and grid-cell–like representations, supporting the link to hippocampal function and multi-modal cognitive maps (Stoewer et al., 2022, Stoewer et al., 2023).
  • In continual learning, SR-based decomposition allows new predictions (GVFs) to be learned more rapidly as only their one-step predictions need updating; this improves learning speed in both simulation and real-robot datasets (Sherstan et al., 2018).

7. Limitations, Extensions, and Open Problems

  • SRs and SFs require the reward to be (approximately) linear in features; Successor Feature Representations (SFRs) extend to general reward functions by learning a density over successor features and integrating against arbitrary reward models (Reinke et al., 2021).
  • Ensemble SRs mitigate coverage and bootstrapping issues in offline-to-online transfer, increasing robustness when offline datasets are narrow (Wang et al., 12 May 2024).
  • Exact tabular inversion scales poorly with state-space size; deep architectures required for large environments introduce new challenges (e.g., feature collapse, off-policy instability).
  • The choice of features ϕ\phi is critical for transfer bounds and option expressivity.
  • Real-world transfer and sample-efficient learning using SRs remain areas of active investigation, including approaches for partial observability, hierarchical abstraction without manual option design, and continual/lifelong learning.

SRs thus provide a unifying predictive substrate in RL, supporting efficient credit assignment, hierarchical decomposition via spectral properties, rapid task transfer, and principal forms of exploration. Their connections to representation learning, neuroscience, option theory, and transfer continue to fuel ongoing research and practical deployment.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Successor Representations (SRs).