Papers
Topics
Authors
Recent
Search
2000 character limit reached

Bellman-Closed Feature Transport in RL

Updated 4 February 2026
  • Bellman-Closed Feature Transport is a framework ensuring that Bellman updates preserve the representational span of state-action features.
  • It employs a spectral learning objective that leverages singular value decomposition of feature covariances to enforce closure under Bellman dynamics.
  • The method enhances exploration and long-horizon credit assignment, demonstrating significant gains in performance on Atari benchmarks.

Bellman-Closed Feature Transport is a principled framework in reinforcement learning (RL) for learning state-action representations such that the span of value functions is preserved—i.e., “closed”—under Bellman backups. Developed as a core component of the Spectral Bellman Method (SBM), this perspective unifies representation learning and structured exploration by leveraging the inherent spectral properties of the Bellman operator when acting on linearly parameterized function classes. The central mechanism involves enforcing or approximating “Bellman-closure” through a spectral learning objective, ensuring that value estimates remain within the learned feature space as dictated by the Bellman update dynamics (Nabati et al., 17 Jul 2025).

1. Formal Foundations: Bellman Closure and Inherent Bellman Error

Let ϕ:S×ARd\phi: S \times A \to \mathbb{R}^d be a feature map, and consider the linear value function class Qϕ={Qθ(s,a)=ϕ(s,a)θ  θBϕ}\mathcal{Q}_\phi = \{ Q_\theta(s, a) = \phi(s, a)^\top\theta\ |\ \theta \in \mathcal{B}_\phi \}. The Inherent Bellman Error (IBE) quantifies the minimal residual when projecting the outcome of a Bellman update back onto Qϕ\mathcal{Q}_\phi: IBEϕ:=supQQϕinfQ~QϕTQQ~=supθBϕinfθ~BϕTQθQθ~.\text{IBE}_\phi := \sup_{Q \in \mathcal{Q}_\phi} \inf_{\tilde{Q} \in \mathcal{Q}_\phi} \| \mathcal{T} Q - \tilde{Q}\|_\infty = \sup_{\theta \in \mathcal{B}_\phi} \inf_{\tilde{\theta} \in \mathcal{B}_\phi} \| \mathcal{T} Q_\theta - Q_{\tilde{\theta}} \|_\infty. Zero-IBE (“Bellman-closed” assumption) holds if IBEϕ=0\text{IBE}_\phi = 0, i.e., the Bellman operator T\mathcal{T} maps Qϕ\mathcal{Q}_\phi exactly into itself. For any policy π\pi, the kk-step Bellman extension is recursively defined: TkQ=T(Tk1Q)\mathcal{T}^k Q = \mathcal{T}(\mathcal{T}^{k-1} Q).

Bellman-closed feature transport refers to constructing or learning a feature space ϕ\phi for which Bellman updates of any QθQ_\theta remain in the span {ϕi}i=1d\{\phi_i\}_{i=1}^d, ensuring the representational stability of value iteration and policy evaluation.

2. Spectral Decomposition and Feature Transport under Zero-IBE

If the Bellman-closed condition is met, a powerful spectral structure emerges. For finite S×AS \times A and parameter set {θj}j=1m\{\theta_j\}_{j=1}^m, let ΦRn×d\Phi \in \mathbb{R}^{n \times d} and ΘRd×m\Theta \in \mathbb{R}^{d\times m} aggregate features and parameters. Under zero-IBE, for any set of weights ρ(s,a)\rho(s,a) and ν(θ)\nu(\theta), the Bellman transport is

Q=Ps,a(ΦΘ)Pθ=ΦPΘ~P,\overline{Q} = P_{s,a} (\Phi \Theta) P_\theta = \Phi_P \tilde{\Theta}_P,

where Ps,aP_{s,a} and PθP_\theta are diagonal matrices of the sampling distributions. The feature covariance Λ=Eρ[ϕϕ]\Lambda = \mathbb{E}_\rho [ \phi \phi^\top ] dictates the singular values of Q\overline{Q}: SVD: Q=UΣV,Σ=diag(λ1,,λd,0,).\text{SVD:}\ \overline{Q} = U \Sigma V^\top,\qquad \Sigma = \text{diag}(\sqrt{\lambda_1}, \ldots, \sqrt{\lambda_d}, 0, \ldots). The nonzero singular values are square roots of the covariance eigenvalues λi\lambda_i. The Bellman operator acts linearly in feature space: Tϕ=ϕA\mathcal{T} \phi = \phi A for some ARd×dA \in \mathbb{R}^{d \times d}, so that for all θ\theta, TQθ(s,a)=ϕ(s,a)θ~(θ)\mathcal{T} Q_\theta(s, a) = \phi(s, a)^\top \tilde{\theta}(\theta). The functional consequence is that Bellman transport never leaves span{ϕi}\operatorname{span} \{\phi_i\}—achieving true feature transport under Bellman closure (Nabati et al., 17 Jul 2025).

3. Spectral Bellman Representation Objective

To operationalize Bellman-closed transport, SBM introduces a spectral loss function, distinct from the standard Bellman mean squared error: LSBM(ϕ,θ~;ρ,ν)=L1(ϕ)+L2(θ~)+Lorth(ϕ,θ~),\mathcal{L}_{\text{SBM}}(\phi, \tilde{\theta}; \rho, \nu) = \mathcal{L}_1(\phi) + \mathcal{L}_2(\tilde{\theta}) + \mathcal{L}_{\text{orth}}(\phi, \tilde{\theta}), with

L1(ϕ)=Eρ,ν[ϕ(s,a)Λ2ϕ(s,a)2Qθ(s,a)ϕ(s,a)θ~(θ)],\mathcal{L}_1(\phi) = \mathbb{E}_{\rho, \nu} [ \phi(s, a)^\top \Lambda_2 \phi(s, a) - 2Q_{\theta}(s, a) \phi(s, a)^\top \tilde{\theta}(\theta) ],

L2(θ~)=Eρ,ν[θ~(θ)Λ1θ~(θ)2Qθ(s,a)θ~(θ)ϕ(s,a)],\mathcal{L}_2(\tilde{\theta}) = \mathbb{E}_{\rho, \nu} [ \tilde{\theta}(\theta)^\top \Lambda_1 \tilde{\theta}(\theta) - 2Q_{\theta}(s, a) \tilde{\theta}(\theta)^\top \phi(s, a) ],

where Λ1\Lambda_{1} and Λ2\Lambda_{2} are batch-averaged feature and parameter covariances. The orthogonality regularizer Lorth\mathcal{L}_{\text{orth}} enforces mutual decorrelation of features and transported parameters. This power-iteration-inspired loss embodies the alternating update structure of the singular value decomposition for the Bellman transport operator. At optimality, these losses guarantee that TQθ\mathcal{T} Q_\theta remains in the representational span {ϕi}\{\phi_i\} for all θ\theta, thus enforcing Bellman-closed transport.

4. Algorithmic Integration and Computational Considerations

Integrating SBM with value-based RL methods requires minimal changes:

  • After each Q-learning update of θ\theta:

    θt+1=argminθE(s,a)D[(r+γmaxaϕt(s,a)θϕt(s,a)θ)2].\theta_{t+1} = \arg\min_\theta \mathbb{E}_{(s, a) \sim \mathcal{D}} [(r + \gamma \max_{a'} \phi_t(s', a')^\top \theta^- - \phi_t(s, a)^\top \theta)^2].

  • Update the feature map by minimizing the spectral Bellman loss

    ϕt+1argminϕLSBM(ϕ,θ~;ρt,νt)\phi_{t+1} \leftarrow \arg\min_{\phi} \mathcal{L}_{\text{SBM}}(\phi, \tilde{\theta}; \rho_t, \nu_t)

with νt=N(θt+1,σrep2I)\nu_t = \mathcal{N}(\theta_{t+1}, \sigma_{\text{rep}}^2 I) and ρt\rho_t from uniform or buffer sampling.

Structured exploration is provided by Thompson Sampling in parameter space: θTSN(θt,σexpΣt1),Σt=λI+ϕtϕt.\theta_{\text{TS}} \sim \mathcal{N}(\theta_t, \sigma_{\text{exp}} \Sigma_t^{-1}), \quad \Sigma_t = \lambda I + \sum \phi_t \phi_t^\top. Computationally, covariance estimation and regularization are O(bd2)O(bd^2) per mini-batch, with no need for large-scale SVD. A full d×dd \times d eigendecomposition (O(d3)O(d^3)) is only needed for exploration variance; this cost can be amortized or approximated.

5. Exploration, Long-Horizon Credit Assignment, and Empirical Impact

Bellman-closed feature transport underpins structured exploration by encoding uncertainty directly in the feature covariance structure. The quantity maxπϕπΣt1\max_\pi \|\phi_\pi\|_{\Sigma_t^{-1}}, where ϕπ=Edπ[ϕ(s,a)]\phi_\pi = \mathbb{E}_{d^\pi}[\phi(s, a)], is the exploration-critical uncertainty term tractably minimized via Thompson Sampling. This mechanism is especially effective in long-horizon, hard-exploration scenarios (Nabati et al., 17 Jul 2025).

In empirical studies on Atari-57 and the “Atari Explore” suite (e.g., Montezuma’s Revenge, Pitfall!, Skiing), SBM-enhanced agents demonstrated substantial gains:

Agent Base (Atari-57) SBM+TS (Atari-57) Base (Atari Explore) SBM+TS (Atari Explore)
DQN 1.61 1.91 0.22 0.42
R2D2 3.2 3.51 0.42 0.67

These improvements are most pronounced for long-horizon, sparse-reward tasks, indicating effective feature transport and credit assignment under Bellman dynamics.

6. Extensions, Generalizations, and Limitations

SBM naturally extends to multi-step Bellman operators. For the hh-step Bellman operator: IBEϕh:=supθinfθ~ThQθQθ~,\text{IBE}_\phi^h := \sup_\theta \inf_{\tilde{\theta}} \|\mathcal{T}^h Q_\theta - Q_{\tilde{\theta}}\|_\infty, with the bound IBEϕhi=0h1γiIBEϕ(1γ)1IBEϕ\text{IBE}_\phi^h \leq \sum_{i=0}^{h-1} \gamma^i \text{IBE}_\phi \leq (1-\gamma)^{-1} \text{IBE}_\phi. Thus, zero-IBE for T\mathcal{T} implies closure for all Th\mathcal{T}^h, supporting use in deep RL architectures like R2D2 and retrace-based off-policy targets.

Limitations include sensitivity to the parameter sampling variance σrep\sigma_{\text{rep}}, incomplete theory for approximate (non-zero) IBE, and empirical validation currently focused on Atari benchmarks. Generalization to continuous control, richer θ~(θ)\tilde{\theta}(\theta) parametrizations, and convergence analysis under stochastic updates remain active research directions.


Bellman-Closed Feature Transport, as instantiated by the Spectral Bellman Method, provides a spectral framework ensuring representational alignment with Bellman dynamics. This approach yields theoretical guarantees of closure, empirical improvements in exploration and credit assignment, and flexible integration with value-based RL algorithms, advancing unified perspectives on representation and exploration in RL (Nabati et al., 17 Jul 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bellman-Closed Feature Transport.