Successor Representations in Reinforcement Learning

Updated 11 November 2025

Successor Representations (SRs) are defined as the discounted expected visitation counts of states under a fixed policy, serving as a predictive model in RL.
SRs enable rapid reward revaluation and efficient transfer across tasks by decoupling value functions from reward structures.
Deep and feature-based extensions of SRs scale to high-dimensional spaces, facilitating intrinsic exploration and option discovery.

Successor representations (SRs) form a predictive model of the expected future occupancy of states under a fixed policy, providing a middle ground between model-based and model-free approaches to reinforcement learning (RL). The SR defines for each state (or feature) the discounted expected visitation counts of all possible successors under the current policy, enabling rapid adaptation to changed reward structures and facilitating efficient option discovery, temporal abstraction, exploration, and transfer across goals or tasks.

1. Formal Definition and Mathematical Properties

Let $\mathcal{S}$ denote the (possibly finite) state space, $\pi$ a stationary policy, and $\gamma\in[0,1)$ a discount factor. The canonical (tabular) SR $M^\pi: \mathcal{S}\times\mathcal{S}\to\mathbb{R}$ is defined as: $M^\pi(s, s') = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t\, \mathbb{1}\{S_t = s'\} \mid S_0 = s\right]$ where $S_t$ denotes the state at time $t$ under dynamics $p$ and $\pi$ , and $\mathbb{1}$ the indicator function. $M^\pi(s,s')$ quantifies the expected discounted number of times state $s'$ will be visited in the future after starting from $s$ .

Writing $M^\pi$ in matrix form (with $P^\pi$ as the policy-induced one-step transition matrix): $M^\pi = \sum_{t=0}^\infty (\gamma P^\pi)^t = (I - \gamma P^\pi)^{-1}$

$M^\pi$ satisfies the Bellman-style fixed-point equation: $M^\pi = I + \gamma P^\pi M^\pi$

This structure underpins several key uses:

The value function for any reward $r$ (vector) is $V^\pi = M^\pi\, r$ .
Rapid reward revaluation: If $r$ changes but $P^\pi$ does not, update $V^\pi$ by a single matrix-vector multiplication.
Extensions to feature-based versions, where each state $s$ is mapped to $\phi(s)\in\mathbb{R}^d$ and successor features $\psi^\pi(s) = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t \phi(S_t)\mid S_0 = s\right]$ are used directly for large or continuous spaces.

2. SRs in Deep and Feature-based Architectures

SR theory has been extended to address high-dimensional or continuous state spaces with deep neural architectures. In such cases, SRs are not stored as explicit $|\mathcal{S}|\times|\mathcal{S}|$ matrices, but as feature-based predictors or networks:

State embeddings $\phi(s;\theta_\phi)\in\mathbb{R}^d$ are learned by a CNN or MLP for visual input.
Successor feature modules $\psi(s;\theta_\psi)$ approximate $\mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t \phi(S_t)\mid S_0 = s\right]$ .
Temporal-difference (TD) learning is employed to iteratively update $\psi$ with the loss:

$L_\text{SR}(s,s') = \left\| \phi^-(s) + \gamma\, \psi^-(\phi^-(s')) - \psi(\phi(s)) \right\|_2^2$

where $(\phi^-, \psi^-)$ are parameters from slowly updated target networks.

This structure is used within both DQN-like value learning (Kulkarni et al., 2016) and actor-critic agents (Siriwardhana et al., 2018), often augmented with auxiliary tasks (e.g., next-frame prediction) to stabilize feature learning.

3. SRs and Option/Eigenoption Discovery

The SR exhibits deep connections to proto-value functions (PVFs) and graph Laplacians, enabling principled construction of temporally-extended options ("eigenoptions"):

In graph-theoretic terms, the eigenvectors of the SR matrix (corresponding to the largest eigenvalues) encode directions of diffusive information flow, coinciding with the smoothest eigenfunctions of the normalized Laplacian:

$\mathcal{L} = D^{-1/2}(D - W) D^{-1/2}$

where $D$ is the degree matrix and $W$ the adjacency matrix.

Eigenoptions are discovered by:
1. Collecting SR/feature vectors $\psi(\phi(s_t))$ over a rollout.
2. Performing eigendecomposition to extract leading eigenvectors.
3. Defining intrinsic rewards $r^e(s,s') = e_i^\top [\phi(s') - \phi(s)]$ per eigenvector.
4. Training option policies to maximize each $r^e$ .
5. Defining option initiation and termination sets using $Q$ -values with respect to $r^e$ .

Empirical evidence (Machado et al., 2017) demonstrates that adding a small number of eigenoptions constructed from SRs sharply reduces diffusion times in navigation tasks and accelerates goal-reaching, even with raw pixel inputs.

4. Successor Features, Universal Successor Representations, and Task Transfer

Beyond simple SRs, the successor feature (SF) or universal successor representation (USR) framework further factorizes value functions:

Define $\phi(s, a, s')\in\mathbb{R}^d$ and, for a policy $\pi$ ,

$\psi^\pi(s,a) = \mathbb{E}_\pi\left[ \sum_{t=0}^{\infty} \gamma^t\, \phi(s_t, a_t, s_{t+1}) \mid s_0=s, a_0=a \right]$

For reward functions $r(s,a,s') = \phi(s,a,s')^\top w$ parameterized by $w$ , the $Q$ -function is $Q^\pi(s,a) = \psi^\pi(s,a)^\top w$ .
USRs further generalize $\psi^\pi$ to be goal-conditional, so that adaptation to new rewards/goals is achieved by learning/updating only $w$ or $w(g)$ , leaving $\psi(\cdot)$ unchanged.

Transfer results:

Once SR/SF $\psi$ is learned for the environment, adaptation to new tasks with the same dynamics but different $w$ requires only quick adaptation of $w$ (or $w(g)$ ).
Empirically, in environments like AI2THOR, task transfer via SR adaptation reduces required learning episodes by an order of magnitude compared to full network retraining (Zhu et al., 2017, Ma et al., 2018).
Theoretical bounds show that transfer loss is proportional to $\| w - w' \|$ between old and new task weights.

5. Exploration and Count-based Intrinsic Reward via the SR

SRs can be used to incentivize exploration without resorting to explicit density models:

The $L_1$ -norm $\|\psi(s)\|_1$ (for tabular SR) or $\|\psi(s)\|$ under learned features quantifies the expected cumulative visitation of state $s$ .
The (inverse) norm can be used as a count-based exploration bonus:

$r^+(s) = \beta \left(1 / \| \psi(s) \|_p \right)$

where $\beta$ is a tuning parameter.

The substochastic SR (SSR) variant analytically relates the SR norm to empirical visitation counts, providing a justification for this bonus (Machado et al., 2018).
Algorithms using SR-norm-based bonuses achieve order-of-magnitude gains in sparse-reward tasks and match R-Max/E $^3$ style sample complexity.

In deep RL, SR-derived intrinsic rewards match or outperform more complex density-model approaches (e.g., PixelCNN, CTS, RND) in sparse Atari environments—particularly in the low-sample regime (Machado et al., 2018).

6. Advanced Theoretical Extensions and Empirical Results

Recent directions include:

Probabilistic/Uncertainty-aware SRs: Kalman Temporal Differences (KTD) for SR give a posterior distribution (mean and covariance) over $M$ , capturing uncertainty and covariances. This results in nonlocal updates and partial transition revaluation, matching human credit-assignment in chains (Geerts et al., 2019).
Distributional/Partially Observable SRs: Distributional codes for SR enable value computation and policy derivation when state is not directly observable; learning is achieved with biologically plausible local synaptic updates (Vertes et al., 2019).
Active Inference and SRs: SRs offer an efficient amortization for Active Inference agents by precomputing $M^\pi$ once and enabling instantaneous value reevaluation for new priors or expected free energy objectives. This significantly reduces planning costs in large discrete state spaces (Millidge et al., 2022).
Temporal Abstractions (t-SR): The t-SR framework generalizes SRs to temporally extended actions (repeat-elsewhere operators), reducing policy-sampling frequency and accelerating reward revaluation in dynamic environments (Sargent et al., 2022).
Exploration Maximizing State Entropy: Conditioning SRs on the explicit past trajectory enables maximizing the entropy of the whole single-episode visitation distribution, systematically driving policies to explore previously unseen states (Jain et al., 2023).

Empirical observations:

SR-based bottleneck and option extraction regularly identifies semantically meaningful subgoals (e.g., room doorways) (Kulkarni et al., 2016).
Neural-network-based SRs recover place and grid-cell–like representations, supporting the link to hippocampal function and multi-modal cognitive maps (Stoewer et al., 2022, Stoewer et al., 2023).
In continual learning, SR-based decomposition allows new predictions (GVFs) to be learned more rapidly as only their one-step predictions need updating; this improves learning speed in both simulation and real-robot datasets (Sherstan et al., 2018).

7. Limitations, Extensions, and Open Problems

SRs and SFs require the reward to be (approximately) linear in features; Successor Feature Representations (SFRs) extend to general reward functions by learning a density over successor features and integrating against arbitrary reward models (Reinke et al., 2021).
Ensemble SRs mitigate coverage and bootstrapping issues in offline-to-online transfer, increasing robustness when offline datasets are narrow (Wang et al., 12 May 2024).
Exact tabular inversion scales poorly with state-space size; deep architectures required for large environments introduce new challenges (e.g., feature collapse, off-policy instability).
The choice of features $\phi$ is critical for transfer bounds and option expressivity.
Real-world transfer and sample-efficient learning using SRs remain areas of active investigation, including approaches for partial observability, hierarchical abstraction without manual option design, and continual/lifelong learning.

SRs thus provide a unifying predictive substrate in RL, supporting efficient credit assignment, hierarchical decomposition via spectral properties, rapid task transfer, and principal forms of exploration. Their connections to representation learning, neuroscience, option theory, and transfer continue to fuel ongoing research and practical deployment.