Distributional Successor Measure
- Distributional Successor Measure is a generalized framework that captures the full distribution of cumulative outcomes, not just their means, under a given policy.
- It leverages methodologies like distributional successor features and measure-valued generative models to enable risk-sensitive evaluation and zero-shot transfer.
- Empirical benchmarks, such as on AntMaze and DM Control Suite, show that these techniques outperform classical approaches by reducing errors from expected-value approximations.
A distributional successor measure is a generalization of the successor representation framework in reinforcement learning and control that captures not only mean statistics of state occupancies under a policy, but the full distributional structure of cumulative future outcomes, occupancy measures, or features. Formally, the distributional successor measure is characterized as either a distribution over future cumulative features, a measure-valued random variable, or as a distribution over distributions (i.e., a probability law on the space of occupancy measures). This distributional perspective enables richer forms of policy evaluation, planning, risk-sensitive control, and zero-shot transfer by supporting downstream application of arbitrary functionals or loss maps, rather than being restricted to expected-value computations.
1. Formal Definitions and Theoretical Foundations
In the context of Markov Decision Processes (MDPs), the classic successor representation (SR) of a policy provides expected (discounted) visitation measures:
This expected measure can be extended in three main senses in current literature:
- Distributional Successor Features (DSF): The discounted sum of feature vectors along trajectories forms a random variable,
and one models the full law rather than just the mean (Zhu et al., 2024).
- Distributional Successor Measure (DSM): The random occupancy measure
is treated as a random variable in the space of measures, and its law
constitutes a distribution on occupancy distributions (Wiltzer et al., 2024).
- Successor Feature Representation (SFR): Directly parameterizes the full discounted distribution over successor features as a vector- or measure-valued object (Reinke et al., 2021).
In all cases, this generalization satisfies a distributional Bellman recursion. For the DSF case,
and for the DSM case,
with the law being the unique fixed point under a contractive operator on the space of probability measures (Wiltzer et al., 2024).
These operators are -contractions in the relevant probability metric (e.g., supremal Wasserstein for DSM (Wiltzer et al., 2024)) and thus, fixed-point iteration or parameteric learning (e.g., diffusion models for (Zhu et al., 2024)) is theoretically sound.
2. Algorithmic Realizations
Distributional Successor Feature learning can be instantiated using parametric deep generative models such as conditional diffusion models (e.g., DDIM) for both the distribution and conditional policies (Zhu et al., 2024).
Learning objectives:
- Score-matching loss for (denoising diffusion objective):
- Policy head training via maximum-likelihood under policy-conditioned targets.
DSM learning employs a two-level particle approximation:
- Each is represented as an ensemble of measure-valued generative models (e.g., neural samplers or GANs), with a model-level Maximum Mean Discrepancy loss between the Bellman target and the current approximation, using kernels defined on measures (Wiltzer et al., 2024).
Notable learning techniques:
- n-step bootstrapping in DSMs to stabilize long-horizon learning.
- Adaptive, adversarial kernels for MMD to match evolving supports of atoms in the measure-ensemble.
- Conditional score guidance ("classifier-free guidance") for efficient conditional sampling in DSF using learned score models (Zhu et al., 2024).
3. Planning, Policy Optimization, and Zero-Shot Transfer
Distributional successor measures enable general-purpose, reward-agnostic transfer and risk-sensitive planning.
- Zero-Shot Policy Optimization: Given a new reward, particularly linear in features (), DSF-based methods search for high-probability, high-return feature outcomes (e.g., maximizing for ), then condition the policy on this outcome for action selection (Zhu et al., 2024).
- Risk-Sensitive Evaluation: DSMs provide full return distributions for any reward functional, admitting evaluation of arbitrary risk measures (e.g., Conditional Value-at-Risk, quantile statistics) without further learning or environment interaction (Wiltzer et al., 2024).
- Generalized Policy Improvement: In SFR, policies can be evaluated and generalized across a bank of source tasks via transfer bounds that depend only on the distance of the target reward to the closest source (Reinke et al., 2021).
These properties remove the need for explicit value iteration, planning rollouts, or per-task adaptation, yielding strong generalization for unseen tasks or reward maps with concretely bounded performance loss relative to approximation error and coverage (Zhu et al., 2024, Wiltzer et al., 2024).
4. Connections to Related Successor Frameworks
Classical successor features (SF), which parameterize the expected cumulative feature as
are a special case recovered by taking expectations in DSF or integrating the measure in SFR (Reinke et al., 2021, Vertes et al., 2019). The distributional generalization decouples the reward entirely from the transition structure and can thus support arbitrary downstream evaluation without linearity or basis mismatch constraints.
The Proto Successor Measure (PSM) formalism provides an alternative affine-space decomposition of the entire policy-induced visitation space (Agarwal et al., 2024). In this perspective, all possible successor measures correspond to an affine subspace (basis+bias), enabling exact zero-shot LP-based policy synthesis for arbitrary rewards.
5. Empirical Results and Benchmarks
Empirical studies across continuous control (e.g., D4RL AntMaze, Franka Kitchen, Roboverse) and discrete domains demonstrate that distributional successor measure methods outperform classical model-free and model-based RL on several metrics (Zhu et al., 2024, Wiltzer et al., 2024, Agarwal et al., 2024):
- AntMaze-medium-diverse-v2: DSF achieves 631 ± 67 average return vs. 236 ± 4 (MOPO), 418 ± 16 (COMBO), 238 ± 4 (SF); guided diffusion-based planning achieves comparable or better performance with lower inference cost (Zhu et al., 2024).
- Preference AntMaze: GOM method respects complex, human-specified preference patterns, while goal-conditioned baselines fail (Zhu et al., 2024).
- Risk-Sensitive Policy Evaluation: DSMs enable accurate zero-shot ranking by CVaR and mean for diverse reward scenarios in continuous and stochastic environments, outperforming model-based rollouts and value-based approaches (Wiltzer et al., 2024).
- Zero-Shot Control: PSMs and DSFs recover optimal or near-optimal performance in both discrete and continuous domains (e.g., 100% four-room gridworld goal-reaching, FetchReach 95% success, DM Control Suite zero-shot return ≈690 on Walker) (Agarwal et al., 2024).
These results validate the central claim that distributional successor measures circumvent compounding-model error and nonlinearity-induced value misestimation endemic to classical value-based and model-based approaches.
6. Extensions: Partial Observability and Biological Plausibility
Distributional successor features have also been investigated with respect to partial observability and neural plausibility. By employing population-coded inference schemes such as the Distributed Distributional Code (DDC), the moment statistics over latent state beliefs can be propagated and combined with distributional SFs. This yields value functions and planning strategies robust to sensory noise and state uncertainty, supported by closed-form matrix-dynamics or recurrent neural implementations and compatible with local Hebbian error-correction learning (Vertes et al., 2019).
7. Summary Table: Distributional Successor Measure Variants
| Variant | Core Object | Bellman Recursion | Applications |
|---|---|---|---|
| DSF/DiSPO (Zhu et al., 2024) | , =cum. features | Zero-shot planning, transfer | |
| DSM (Wiltzer et al., 2024) | distrib. on occ. measures | operator (distr. contraction) | Risk-sensitive eval, transfer |
| SFR (Reinke et al., 2021) | feature law | Transfer, general reward eval | |
| PSM (Agarwal et al., 2024) | Affine basis of successor measures | Linear Bellman-flow constraints | Zero-shot control, convex synthesis |
References
- "Distributional Successor Features Enable Zero-Shot Policy Optimization" (Zhu et al., 2024)
- "A Distributional Analogue to the Successor Representation" (Wiltzer et al., 2024)
- "Proto Successor Measure: Representing the Behavior Space of an RL Agent" (Agarwal et al., 2024)
- "Successor Feature Representations" (Reinke et al., 2021)
- "A neurally plausible model learns successor representations in partially observable environments" (Vertes et al., 2019)