Bellman-Closed Feature Transport in RL
- Bellman-Closed Feature Transport is a framework ensuring that Bellman updates preserve the representational span of state-action features.
- It employs a spectral learning objective that leverages singular value decomposition of feature covariances to enforce closure under Bellman dynamics.
- The method enhances exploration and long-horizon credit assignment, demonstrating significant gains in performance on Atari benchmarks.
Bellman-Closed Feature Transport is a principled framework in reinforcement learning (RL) for learning state-action representations such that the span of value functions is preserved—i.e., “closed”—under Bellman backups. Developed as a core component of the Spectral Bellman Method (SBM), this perspective unifies representation learning and structured exploration by leveraging the inherent spectral properties of the Bellman operator when acting on linearly parameterized function classes. The central mechanism involves enforcing or approximating “Bellman-closure” through a spectral learning objective, ensuring that value estimates remain within the learned feature space as dictated by the Bellman update dynamics (Nabati et al., 17 Jul 2025).
1. Formal Foundations: Bellman Closure and Inherent Bellman Error
Let be a feature map, and consider the linear value function class . The Inherent Bellman Error (IBE) quantifies the minimal residual when projecting the outcome of a Bellman update back onto : Zero-IBE (“Bellman-closed” assumption) holds if , i.e., the Bellman operator maps exactly into itself. For any policy , the -step Bellman extension is recursively defined: .
Bellman-closed feature transport refers to constructing or learning a feature space for which Bellman updates of any remain in the span , ensuring the representational stability of value iteration and policy evaluation.
2. Spectral Decomposition and Feature Transport under Zero-IBE
If the Bellman-closed condition is met, a powerful spectral structure emerges. For finite and parameter set , let and aggregate features and parameters. Under zero-IBE, for any set of weights and , the Bellman transport is
where and are diagonal matrices of the sampling distributions. The feature covariance dictates the singular values of : The nonzero singular values are square roots of the covariance eigenvalues . The Bellman operator acts linearly in feature space: for some , so that for all , . The functional consequence is that Bellman transport never leaves —achieving true feature transport under Bellman closure (Nabati et al., 17 Jul 2025).
3. Spectral Bellman Representation Objective
To operationalize Bellman-closed transport, SBM introduces a spectral loss function, distinct from the standard Bellman mean squared error: with
where and are batch-averaged feature and parameter covariances. The orthogonality regularizer enforces mutual decorrelation of features and transported parameters. This power-iteration-inspired loss embodies the alternating update structure of the singular value decomposition for the Bellman transport operator. At optimality, these losses guarantee that remains in the representational span for all , thus enforcing Bellman-closed transport.
4. Algorithmic Integration and Computational Considerations
Integrating SBM with value-based RL methods requires minimal changes:
- After each Q-learning update of :
- Update the feature map by minimizing the spectral Bellman loss
with and from uniform or buffer sampling.
Structured exploration is provided by Thompson Sampling in parameter space: Computationally, covariance estimation and regularization are per mini-batch, with no need for large-scale SVD. A full eigendecomposition () is only needed for exploration variance; this cost can be amortized or approximated.
5. Exploration, Long-Horizon Credit Assignment, and Empirical Impact
Bellman-closed feature transport underpins structured exploration by encoding uncertainty directly in the feature covariance structure. The quantity , where , is the exploration-critical uncertainty term tractably minimized via Thompson Sampling. This mechanism is especially effective in long-horizon, hard-exploration scenarios (Nabati et al., 17 Jul 2025).
In empirical studies on Atari-57 and the “Atari Explore” suite (e.g., Montezuma’s Revenge, Pitfall!, Skiing), SBM-enhanced agents demonstrated substantial gains:
| Agent | Base (Atari-57) | SBM+TS (Atari-57) | Base (Atari Explore) | SBM+TS (Atari Explore) |
|---|---|---|---|---|
| DQN | 1.61 | 1.91 | 0.22 | 0.42 |
| R2D2 | 3.2 | 3.51 | 0.42 | 0.67 |
These improvements are most pronounced for long-horizon, sparse-reward tasks, indicating effective feature transport and credit assignment under Bellman dynamics.
6. Extensions, Generalizations, and Limitations
SBM naturally extends to multi-step Bellman operators. For the -step Bellman operator: with the bound . Thus, zero-IBE for implies closure for all , supporting use in deep RL architectures like R2D2 and retrace-based off-policy targets.
Limitations include sensitivity to the parameter sampling variance , incomplete theory for approximate (non-zero) IBE, and empirical validation currently focused on Atari benchmarks. Generalization to continuous control, richer parametrizations, and convergence analysis under stochastic updates remain active research directions.
Bellman-Closed Feature Transport, as instantiated by the Spectral Bellman Method, provides a spectral framework ensuring representational alignment with Bellman dynamics. This approach yields theoretical guarantees of closure, empirical improvements in exploration and credit assignment, and flexible integration with value-based RL algorithms, advancing unified perspectives on representation and exploration in RL (Nabati et al., 17 Jul 2025).