Distributional Successor Measure

Updated 20 March 2026

Distributional Successor Measure is a generalized framework that captures the full distribution of cumulative outcomes, not just their means, under a given policy.
It leverages methodologies like distributional successor features and measure-valued generative models to enable risk-sensitive evaluation and zero-shot transfer.
Empirical benchmarks, such as on AntMaze and DM Control Suite, show that these techniques outperform classical approaches by reducing errors from expected-value approximations.

A distributional successor measure is a generalization of the successor representation framework in reinforcement learning and control that captures not only mean statistics of state occupancies under a policy, but the full distributional structure of cumulative future outcomes, occupancy measures, or features. Formally, the distributional successor measure is characterized as either a distribution over future cumulative features, a measure-valued random variable, or as a distribution over distributions (i.e., a probability law on the space of occupancy measures). This distributional perspective enables richer forms of policy evaluation, planning, risk-sensitive control, and zero-shot transfer by supporting downstream application of arbitrary functionals or loss maps, rather than being restricted to expected-value computations.

1. Formal Definitions and Theoretical Foundations

In the context of Markov Decision Processes (MDPs), the classic successor representation (SR) of a policy $\pi$ provides expected (discounted) visitation measures:

$M^\pi(s, s') = \mathbb{E}_\pi\left[ \sum_{t=0}^\infty \gamma^t\, \mathbb{I}\{s_t = s'\} \,|\, s_0=s \right].$

This expected measure can be extended in three main senses in current literature:

Distributional Successor Features (DSF): The discounted sum of feature vectors along trajectories forms a random variable,

$Z_\pi(s) := \sum_{t=0}^\infty \gamma^t\,\phi(s_t),$

and one models the full law $p_\pi(Z|s)$ rather than just the mean $\psi_\pi(s) = \mathbb{E}[Z_\pi(s)]$ (Zhu et al., 2024).

Distributional Successor Measure (DSM): The random occupancy measure

$M^\pi(\cdot\mid x) = \sum_{k=0}^\infty (1-\gamma)\,\gamma^k \, \delta_{X_k}(\cdot), \quad X_0=x,$

is treated as a random variable in the space of measures, and its law

$\mathcal{T}^\pi(x) = \mathrm{Law}(M^\pi(\cdot|x)) \in \mathscr{P}(\mathscr{P}(\mathcal{X}))$

constitutes a distribution on occupancy distributions (Wiltzer et al., 2024).

Successor Feature Representation (SFR): Directly parameterizes the full discounted distribution over successor features as a vector- or measure-valued object $\xi^\pi(s,a,\phi)$ (Reinke et al., 2021).

In all cases, this generalization satisfies a distributional Bellman recursion. For the DSF case,

$p(Z|s) = \mathrm{Law}_{a \sim \pi(\cdot|s),\, s' \sim T(s,a)}[\, \phi(s) + \gamma Z' \,], \qquad Z' \sim p(\cdot|s'),$

and for the DSM case,

$M^\pi(S|x) = (1-\gamma)\, \delta_x(S) + \gamma\, M^\pi(S|X'), \quad X' \sim p^\pi(\cdot|x),$

with the law $\mathcal{T}^\pi(x)$ being the unique fixed point under a contractive operator on the space of probability measures (Wiltzer et al., 2024).

These operators are $\gamma$ -contractions in the relevant probability metric (e.g., supremal Wasserstein for DSM (Wiltzer et al., 2024)) and thus, fixed-point iteration or parameteric learning (e.g., diffusion models for $p_\theta(Z|s)$ (Zhu et al., 2024)) is theoretically sound.

2. Algorithmic Realizations

Distributional Successor Feature learning can be instantiated using parametric deep generative models such as conditional diffusion models (e.g., DDIM) for both the distribution $p_\theta(Z|s)$ and conditional policies $\pi_\rho(a|s,Z)$ (Zhu et al., 2024).

Learning objectives:

Score-matching loss for $p_\theta$ (denoising diffusion objective):

$L_\mathrm{SF}(\theta) = \mathbb{E}_{s,Z,t,\epsilon}\,\|\epsilon - \epsilon_\theta(\sqrt{\alpha_t}Z + \sqrt{1-\alpha_t}\epsilon, t, s)\|^2$

Policy head training via maximum-likelihood under policy-conditioned targets.

DSM learning employs a two-level particle approximation:

Each $\mathcal{T}(x)$ is represented as an ensemble of measure-valued generative models $\{\mu_i(x)\}$ (e.g., neural samplers or GANs), with a model-level Maximum Mean Discrepancy loss between the Bellman target and the current approximation, using kernels defined on measures (Wiltzer et al., 2024).

Notable learning techniques:

n-step bootstrapping in DSMs to stabilize long-horizon learning.
Adaptive, adversarial kernels for MMD to match evolving supports of atoms in the measure-ensemble.
Conditional score guidance ("classifier-free guidance") for efficient conditional sampling in DSF using learned score models (Zhu et al., 2024).

3. Planning, Policy Optimization, and Zero-Shot Transfer

Distributional successor measures enable general-purpose, reward-agnostic transfer and risk-sensitive planning.

Zero-Shot Policy Optimization: Given a new reward, particularly linear in features ( $r(s)= w^\top \phi(s)$ ), DSF-based methods search for high-probability, high-return feature outcomes $Z^*$ (e.g., maximizing $w^\top Z$ for $Z \sim p_\theta(Z|s)$ ), then condition the policy $\pi_\rho$ on this outcome for action selection (Zhu et al., 2024).
Risk-Sensitive Evaluation: DSMs provide full return distributions for any reward functional, admitting evaluation of arbitrary risk measures (e.g., Conditional Value-at-Risk, quantile statistics) without further learning or environment interaction (Wiltzer et al., 2024).
Generalized Policy Improvement: In SFR, policies can be evaluated and generalized across a bank of source tasks via transfer bounds that depend only on the distance of the target reward to the closest source (Reinke et al., 2021).

These properties remove the need for explicit value iteration, planning rollouts, or per-task adaptation, yielding strong generalization for unseen tasks or reward maps with concretely bounded performance loss relative to approximation error and coverage (Zhu et al., 2024, Wiltzer et al., 2024).

Classical successor features (SF), which parameterize the expected cumulative feature as

$\psi^\pi(s) = \mathbb{E}_\pi\left[\sum_{t=0}^\infty \gamma^t \phi(s_t) \mid s_0=s\right],$

are a special case recovered by taking expectations in DSF or integrating the measure in SFR (Reinke et al., 2021, Vertes et al., 2019). The distributional generalization decouples the reward entirely from the transition structure and can thus support arbitrary downstream evaluation without linearity or basis mismatch constraints.

The Proto Successor Measure (PSM) formalism provides an alternative affine-space decomposition of the entire policy-induced visitation space (Agarwal et al., 2024). In this perspective, all possible successor measures $M^\pi$ correspond to an affine subspace (basis+bias), enabling exact zero-shot LP-based policy synthesis for arbitrary rewards.

5. Empirical Results and Benchmarks

Empirical studies across continuous control (e.g., D4RL AntMaze, Franka Kitchen, Roboverse) and discrete domains demonstrate that distributional successor measure methods outperform classical model-free and model-based RL on several metrics (Zhu et al., 2024, Wiltzer et al., 2024, Agarwal et al., 2024):

AntMaze-medium-diverse-v2: DSF achieves 631 ± 67 average return vs. 236 ± 4 (MOPO), 418 ± 16 (COMBO), 238 ± 4 (SF); guided diffusion-based planning achieves comparable or better performance with lower inference cost (Zhu et al., 2024).
Preference AntMaze: GOM method respects complex, human-specified preference patterns, while goal-conditioned baselines fail (Zhu et al., 2024).
Risk-Sensitive Policy Evaluation: DSMs enable accurate zero-shot ranking by CVaR and mean for diverse reward scenarios in continuous and stochastic environments, outperforming model-based rollouts and value-based approaches (Wiltzer et al., 2024).
Zero-Shot Control: PSMs and DSFs recover optimal or near-optimal performance in both discrete and continuous domains (e.g., 100% four-room gridworld goal-reaching, FetchReach $\geq$ 95% success, DM Control Suite zero-shot return ≈690 on Walker) (Agarwal et al., 2024).

These results validate the central claim that distributional successor measures circumvent compounding-model error and nonlinearity-induced value misestimation endemic to classical value-based and model-based approaches.

6. Extensions: Partial Observability and Biological Plausibility

Distributional successor features have also been investigated with respect to partial observability and neural plausibility. By employing population-coded inference schemes such as the Distributed Distributional Code (DDC), the moment statistics over latent state beliefs can be propagated and combined with distributional SFs. This yields value functions and planning strategies robust to sensory noise and state uncertainty, supported by closed-form matrix-dynamics or recurrent neural implementations and compatible with local Hebbian error-correction learning (Vertes et al., 2019).

7. Summary Table: Distributional Successor Measure Variants

Variant	Core Object	Bellman Recursion	Applications
DSF/DiSPO (Zhu et al., 2024)	$p(Z\|s)$ , $Z$ =cum. features	$p(Z\|s) = \mathrm{Law}(\phi(s) + \gamma Z')$	Zero-shot planning, transfer
DSM (Wiltzer et al., 2024)	$\mathcal{T}^\pi(x)$ distrib. on occ. measures	$\mathcal{B}$ operator (distr. contraction)	Risk-sensitive eval, transfer
SFR (Reinke et al., 2021)	$\xi^\pi(s,a,\phi)$ feature law	$\xi^\pi(s,a,\phi) = p(\phi\|s,a) + \gamma \mathbb{E}[...]$	Transfer, general reward eval
PSM (Agarwal et al., 2024)	Affine basis of successor measures	Linear Bellman-flow constraints	Zero-shot control, convex synthesis

References

"Distributional Successor Features Enable Zero-Shot Policy Optimization" (Zhu et al., 2024)
"A Distributional Analogue to the Successor Representation" (Wiltzer et al., 2024)
"Proto Successor Measure: Representing the Behavior Space of an RL Agent" (Agarwal et al., 2024)
"Successor Feature Representations" (Reinke et al., 2021)
"A neurally plausible model learns successor representations in partially observable environments" (Vertes et al., 2019)