Successor Feature Decomposition

Updated 21 April 2026

Successor Feature Decomposition is a factorization method in reinforcement learning that decouples environment dynamics from task-specific rewards.
It decomposes the action-value function into reward-agnostic successor features and task-dependent reward weights to enable efficient transfer between different tasks.
This approach supports rapid policy adaptation, multi-objective optimization, meta-RL, continual learning, and effective planning under complex task specifications.

Successor Feature Decomposition provides a principled method for decoupling the dynamics and rewards in reinforcement learning (RL), enabling efficient transfer across multiple tasks that share environmental transitions but differ in reward structures. By factorizing the action-value function into a product of reward-agnostic "successor features" and task-specific reward weights, it supports rapid policy adaptation, multi-objective optimization, meta-RL, continual learning, and efficient planning under complex task specifications. The following sections offer a comprehensive account of the mathematical formalism, algorithmic foundations, theoretical properties, representative algorithms, empirical results, and practical limitations of the successor feature decomposition paradigm.

1. Mathematical Formalism and Value Function Decomposition

Define an MDP as $M = (\mathcal{S}, \mathcal{A}, p, r, \mu, \gamma)$ , where $\mathcal{S}$ is the state space, $\mathcal{A}$ the action space, $p(\cdot|s,a)$ the transition kernel, $\mu$ an initial-state distribution, and $\gamma \in [0,1)$ the discount factor. The successor feature decomposition rests on the assumption that for every task $i$ , the reward is given by a linear combination of features: $r_w(s,a,s') = \phi(s,a,s')^\top w$ where $\phi(s,a,s') \in \mathbb{R}^d$ encodes reward-relevant features, and $w \in \mathbb{R}^d$ is a task-specific weight vector.

For any stationary policy $\mathcal{S}$ 0, define its successor feature (SF) map as

$\mathcal{S}$ 1

which characterizes the expected discounted cumulative features under $\mathcal{S}$ 2.

The action-value function for task $\mathcal{S}$ 3 factorizes as

$\mathcal{S}$ 4

Thus, $\mathcal{S}$ 5 encapsulates all transition and policy-dependent information, while $\mathcal{S}$ 6 solely encodes the task-dependent reward parameters (Barreto et al., 2016, Alegre et al., 2022).

2. Bellman Recursion and Algorithmic Computation of Successor Features

The SFs satisfy the Bellman recursion: $\mathcal{S}$ 7 In control settings, $\mathcal{S}$ 8 under a greedy policy.

A standard approach, generalizing TD learning, is to parameterize $\mathcal{S}$ 9 (with deep networks or in tabular form) and minimize the mean-squared Bellman error: $\mathcal{A}$ 0 Coupled with a reward regression loss

$\mathcal{A}$ 1

this dual-objective architecture enforces both accurate feature-based reward prediction and correct SF propagation, preventing trivial solutions and representation collapse (Chua et al., 2024).

In practice, the reward loss is often computed with $\mathcal{A}$ 2 treated as constant (stop-gradient) w.r.t. $\mathcal{A}$ 3, ensuring that feature learning is not degenerate under sparse or constant rewards.

3. Theoretical Properties: Optimality, Transfer, and Generalization

The decoupling of $\mathcal{A}$ 4 and $\mathcal{A}$ 5 unlocks efficient transfer in multi-task and continual RL: after pre-training $\mathcal{A}$ 6, a new task requires only regression of $\mathcal{A}$ 7 using a small batch of transitions.

A cornerstone is the Generalized Policy Improvement (GPI) theorem (Barreto et al., 2016, Alegre et al., 2022, Barreto et al., 2019):

For a set of $\mathcal{A}$ 8 policies with SFs $\mathcal{A}$ 9, the GPI policy

$p(\cdot|s,a)$ 0

satisfies

$p(\cdot|s,a)$ 1

and, if the set of stored policies forms a convex coverage set (CCS) in feature-space, GPI recovers the true optimal policy for any $p(\cdot|s,a)$ 2 (Alegre et al., 2022, Infante et al., 2024).

Key generalization and convergence results include:

Task transfer bound: The performance loss for GPI in a new task $p(\cdot|s,a)$ 3 is bounded by the distance to the nearest previously seen $p(\cdot|s,a)$ 4:

$p(\cdot|s,a)$ 5

where $p(\cdot|s,a)$ 6 is the SF approximation error (Barreto et al., 2016, Zhang et al., 2024).

Provable convergence in deep function approximation: Full-gradient SF-Q-learning achieves almost-sure convergence and sample-efficient transfer, outperforming semi-gradient baselines (Shrirao et al., 1 Apr 2026).

4. Successor Feature Decomposition in Transfer, Meta-RL, and Planning

Transfer RL: SF decomposition enables "zero-shot" reuse of previously trained policies for any new linear reward specification; regression over historical transitions yields the new $p(\cdot|s,a)$ 7, and GPI combines the policy ensemble optimally (Barreto et al., 2016, Barreto et al., 2019, Borsa et al., 2018, Lehnert et al., 2017, Zhang et al., 2024).

Multi-objective and complex task composition: The SF framework unifies linear reward transfer with multi-objective RL. Construction of a convex coverage set (CCS) of SFs—e.g., via Optimistic Linear Support (SFOLS)—guarantees that for any linearly-expressible reward, the policy library forms the Pareto frontier and GPI achieves the optimal blend (Alegre et al., 2022, Infante et al., 2024). Tasks specified via finite-state automata (FSAs) can be decomposed into subpolicies whose SFs support hierarchical and non-Markovian compositional planning with global optimality (Infante et al., 2024).

Meta-RL and context inference: SFs partition the information about dynamics and reward, supporting meta-RL frameworks that employ context encoders over SFs and reward weights for rapid adaptation (Han et al., 2022). This factorization outperforms transition-only trajectory encoders, provides better context disentanglement, and allows data-efficient adaptation across tasks.

Goal-conditioning and exploration: In high-dimensional, long-horizon goal-conditioned RL (GCRL), successor features underpin both exploration bonuses (via SF-based novelty metrics) and goal-conditioned control (via SF-based Q-decomposition), enabling scalable graph-based planners for complex navigation domains (Hoang et al., 2021).

5. Extensions and Implementation Variants

Feature Learning: While early SF methods assumed readily available or fixed $p(\cdot|s,a)$ 8, contemporary approaches learn features end-to-end using auxiliary reward-prediction, contrastive, or mutual information objectives (Chua et al., 2024, Carvalho et al., 2023, Hansen et al., 2019). Categorical or universal SF approximators (e.g., CSFA, USFA) condition on explicit task codes or context embeddings, enhancing generalization and supporting large-scale flexible transfer (Borsa et al., 2018, Carvalho et al., 2023).

Handling Nonlinear Rewards: The canonical SF decomposition is restricted to linearly-parameterizable rewards. Successor Feature Representations (SFR) generalize SFs by estimating the cumulative future distribution over features, allowing policy evaluation for arbitrary reward functions $p(\cdot|s,a)$ 9 (Reinke et al., 2021).

Nonlinear Function Approximation and Stability: Standard SF learning with deep networks often relies on semi-gradient TD updates, which can be unstable. Full-gradient schemes (FG-SFRQL) jointly optimize the full Bellman residual, providing convergence guarantees and reducing instability in complex domains (Shrirao et al., 1 Apr 2026).

Transfer Across Dynamics: Extensions utilizing Gaussian Process SF models (GP-SFs) treat source-task SFs as noisy measurements for target environments, enabling sample-efficient adaptation even across transitions shifts (Abdolshah et al., 2021).

Unsupervised Skill Discovery: SF-based representations are effective foundations for unsupervised pre-training, skill induction, and exploration; methods such as VISR, NMPS, and SFL explicitly leverage SFs' decoupling properties for scalable skill learning and rapid downstream adaptation (Hansen et al., 2019, Kim et al., 2024, Hoang et al., 2021).

Inverse Reinforcement Learning: Successor-feature matching enables direct policy gradient-based imitation from demonstrations, even in state-only settings, bypassing the need for adversarial reward learning (Jain et al., 2024).

6. Empirical Validation and Domains of Application

SF decomposition has been empirically validated across:

Classic RL benchmarks (Four Rooms, Deep Sea Treasure, MultiRoom, grid worlds), demonstrating efficient transfer, rapid adaptation to changing goals, and improved policy exploration (Barreto et al., 2016, Alegre et al., 2022, Hoang et al., 2021).
Continuous control domains (MuJoCo, Reacher, Half-Cheetah, Walker, Quadruped): SF-enabled algorithms achieve superior transfer efficiency and final performance compared to DQN, actor-critic, or monolithic pretraining methods (Kim et al., 2024, Zhang et al., 2024, Chua et al., 2024).
Vision-based navigation and high-dimensional control (DeepMind Lab, ViZDoom, Minigrid, Miniworld): End-to-end deep SF learning—especially via reward-as-feature and keyboard-based approaches—affords near-instantaneous skill reuse on novel composites of tasks (Borsa et al., 2018, Carvalho et al., 2023).
Non-Markovian task specifications and compositional planning: SF-based approaches enable globally optimal solution synthesis from learned policy bases (Infante et al., 2024).
Lifelong RL and continual learning: Online SF adaptation supports robust reuse, fast adjustment to task shifts, and resistance to catastrophic forgetting (Chua et al., 2024).

7. Limitations, Practical Issues, and Future Trends

Linear reward assumption: Classical SF decomposition is limited to environments where all tasks of interest admit a known or learnable linear reward structure. Extensions such as SFR (Reinke et al., 2021) or learned universal basis features mitigate this restriction.

Policy dependence: SFs are always policy-specific ( $\mu$ 0); if the optimal policy changes drastically between tasks, the previously learned SFs may be suboptimal and require recomputation (Lehnert et al., 2017, Zhang et al., 2024). Generalized Policy Improvement over a diverse library of policies partially alleviates this issue.

Representation collapse and stability: Single-term Bellman losses can lead to degenerate (constant) feature encodings; this is addressed with joint reward-prediction and SF objectives, categorical output heads, stop-gradient regularizations, and target network synchronization (Chua et al., 2024, Carvalho et al., 2023).

Challenge of feature learning: Discovering a minimal, sufficient basis $\mu$ 1 is nontrivial in complex or visually-rich environments; recent advances leverage end-to-end contrastive learning, skill-discovery, and auxiliary loss architectures (Chua et al., 2024, Borsa et al., 2018, Carvalho et al., 2023).

Computational complexity: Construction of convex coverage sets and joint learning of CCS policy libraries can be costly in high-dimensional or highly multi-objective domains, though the online cost post-training is typically low (Alegre et al., 2022, Infante et al., 2024).

Related research directions include: generalization to nonlinear or non-parametric reward functionals, deeper integration with bisimulation-based state abstractions, improved modularity for compositional RL, robust transfer across both reward and transition function shift, and hybridization with model-based planning algorithms.

Successor Feature Decomposition thus provides an algebraic foundation for sample-efficient, modular, and transferable reinforcement learning across diverse settings, with deep theoretical guarantees and substantial empirical validation across tabular, deep, multitask, meta-learning, and continual RL scenarios (Barreto et al., 2016, Alegre et al., 2022, Borsa et al., 2018, Carvalho et al., 2023, Infante et al., 2024, Chua et al., 2024, Shrirao et al., 1 Apr 2026, Jain et al., 2024, Reinke et al., 2021, Zhang et al., 2024, Barreto et al., 2019).