Smoothed Bellman Equation

Updated 30 March 2026

Smoothed Bellman equations are reformulations of traditional Bellman/HJB equations that introduce analytic smoothing (via softmax, Gaussian convolution, or PDE-based methods) to achieve differentiability and facilitate gradient-based optimization.
They are widely applied in reinforcement learning, stochastic control with delays, and risk-sensitive settings to enhance stability, improve convergence, and offer robust sample complexity in high-dimensional systems.
Methods such as entropy-regularized operators, Gaussian smoothing, and exponential transforms provide a trade-off between bias and variance while ensuring unique fixed points and effective continuous-time policy evaluation.

The smoothed Bellman equation denotes a family of reformulations of the classical Bellman or Hamilton–Jacobi–Bellman (HJB) equations in reinforcement learning (RL), stochastic control, and dynamic programming. These modifications introduce smoothing—via analytic convolution, entropy regularization, exponential transforms, or PDE-based generators—in order to address non-differentiability, improve regularity, and facilitate convergence in the presence of function approximation, time-discretization, or delay. Smoothed Bellman equations have played central roles in stable RL with nonlinear approximators, risk-sensitive and robust planning, stochastic control with delays, and high-accuracy policy evaluation in continuous-time systems.

1. Classical Bellman Equations and Non-smoothness

The foundational Bellman equation for value functions in Markov Decision Processes (MDPs) or in control theory is

$(TV)(s)=\max_{a\in\mathcal{A}}\bigl\{R(s,a)+\gamma\,\mathbb{E}_{s'}V(s')\bigr\}$

with discount factor $\gamma\in[0,1)$ , reward $R$ , and transition kernel $P$ . In stochastic optimal control, the corresponding continuous-time HJB equation involves a “sup” over controls and a second-order semi-linear PDE. The max or sup operator introduces piecewise linearity and non-differentiability, complicating both analysis and practical optimization. This non-smoothness can cause instability or divergence under nonlinear function approximation, and precludes the use of gradient-based methods.

To address these shortcomings, smoothing approaches have been developed that replace the non-differentiable maximization with operations such as softmax (log-sum-exp), analytic convolution, entropy regularization, exponential transforms, or the replacement of discrete Bellman recursions with continuous PDEs.

2. Entropy-Regularized and Softmax Smoothed Bellman Operators

One prominent class of smoothed Bellman equations replaces the max with a softmax (log-sum-exp) operator, or equivalently, introduces entropy regularization. For $\lambda>0$ , the $\lambda$ -smoothed Bellman operator is

$(T_\lambda V)(s)=\lambda\,\log\Bigl(\sum_{a\in\mathcal{A}}\exp\bigl(\tfrac{1}{\lambda}(R(s,a)+\gamma\,\mathbb{E}_{s'}V(s'))\bigr)\Bigr)$

or, in entropy-regularized policy form,

$V_\lambda(s) = \max_{\pi(\cdot|s)}\Big\{\,\mathbb{E}_{a\sim\pi}\bigl[R(s,a)+\gamma\,\mathbb{E}_{s'}V_\lambda(s')\bigr]+\lambda\,H(\pi(\cdot|s))\Big\}$

where $H(\pi)$ is the Shannon entropy. This operator is everywhere differentiable, strictly monotone, and a Banach contraction with modulus $\gamma$ in the sup norm. The parameter $\lambda$ trades off bias (towards the true optimal value) and smoothness: as $\lambda\to0$ , $T_\lambda\to T$ , but the operator becomes non-differentiable; for $\lambda>0$ , smoothness is gained at the expense of bias $O\bigl(\lambda\log|\mathcal{A}|/(1-\gamma)\bigr)$ in the optimal value.

This smoothing principle facilitates convergence and stable training for nonlinear function approximators. The SBEED algorithm (Smoothed Bellman Error Embedding) exploits this approach and frames the smoothed Bellman equation as a convex–concave saddle-point problem via Legendre–Fenchel duality, enabling scalable stochastic mirror-descent updates and providing statistical sample-complexity guarantees with optimal linear horizon scaling (Dai et al., 2017, Touati et al., 2020). SBEED achieves empirically greater stability and sample efficiency across standard RL benchmarks.

3. Analytic Smoothing: Gaussian Convolution and Policy Gradients

An analytic form of Bellman smoothing involves direct convolution in the action-space. For a Gaussian smoothing kernel with covariance $\Sigma(s)$ , the Gaussian-smoothed Q-function is defined as

$Q_\Sigma(s,a)=\mathbb{E}_{a'\sim\mathcal{N}(a,\Sigma(s))}[Q^\pi(s,a')]$

For Gaussian policies $\pi_\theta(a|s)=\mathcal{N}(\mu_\theta(s),\Sigma_\theta(s))$ , the smoothed Q-function yields a Bellman-consistency equation: $Q_\Sigma(s,a)=\mathbb{E}_{r,s'}[r+\gamma\,Q_\Sigma(s',\mu(s'))]$ which is fully compatible with standard TD-learning updates, except the next-action is evaluated at the Gaussian mean (Nachum et al., 2018).

The smoothed $Q_\Sigma$ function admits efficient gradient and Hessian-based policy updates:

The gradient in $a$ at $a=\mu_\theta(s)$ yields the mean-policy gradient,
The Hessian in $a$ yields the covariance-policy gradient. This enables direct deterministic actor-critic methods with low-variance updates and built-in control over exploration via the learned covariance.

4. Exponential and Entropic Smoothing for Risk-Sensitive RL

For risk-sensitive or robust RL, the exponential Bellman equation emerges from optimizing the entropic risk measure. The value criterion is

$V^\pi(s)=\frac{1}{\beta}\log\,\mathbb{E}_\pi\bigl[\exp\bigl(\beta\,\sum_{t}r_t\bigr)\bigm|s_0=s\bigr]$

with risk sensitivity $\beta\ne0$ . The recursion is then

$\exp[\beta\,Q_h^\pi(s,a)]=\mathbb{E}_{s'}\exp\{\beta[r_h(s,a)+V_{h+1}^\pi(s')]\}$

The log-MGF transform converts the usual pointwise maximum or expectation into a smooth, differentiable operator interpolating between max and mean. As $\beta\to+\infty$ it approaches the hard max; as $\beta\to0$ it recovers the expectation. This smoothing enables sharper regret bounds and stability, with a bias–variance trade-off that is explicitly controlled by $\beta$ (Fei et al., 2021).

5. Smoothing in Stochastic Control with Delay and Infinite-Dimensional HJB

In stochastic control systems with delay—particularly delay in the control variable—the associated HJB equation becomes infinite-dimensional and may lack smoothing properties due to violation of the structure condition (i.e., the directions of the control and the noise do not align). The smoothed, or more precisely, partially smoothed, Bellman approach developed by Gozzi and Masiero constructs a mild form of the infinite-dimensional HJB equation using the partial smoothing property of the associated Ornstein–Uhlenbeck transition semigroup (Gozzi et al., 2015, Gozzi et al., 2016). The central insight is that while strong Feller (full smoothing) may not hold, the semigroup regularizes functionals of the state in the control directions (images of the control operator $B$ ), under a Kalman-type rank condition.

This partial smoothing permits a contraction mapping argument in an appropriate Banach space of functions with weighted $B$ -directional derivatives, resulting in the existence of classical solutions to the infinite-dimensional HJB equation and synthesis of optimal feedback controls, even in cases with pointwise-delayed controls and without backward SDE techniques.

6. PDE-based Smoothing: PhiBE and Continuous-Time Policy Evaluation

PhiBE introduces a PDE-based “smoothed” Bellman equation for continuous-time RL with dynamics governed by unknown SDEs and only discrete-time observations. Rather than recursing with potentially inaccurate discrete-time Bellman equations,

$V_{BE}(x)\approx r(x)\Delta t + e^{-\gamma\Delta t} \mathbb{E}_{x'|x}[V(x')]$

PhiBE reconstructs the infinitesimal generator $\mathcal{L}$ empirically, forming a second-order PDE: $\mu_1(x)\cdot\nabla V(x) + \frac{1}{2}\Sigma_1(x):\nabla^2 V(x) - \gamma V(x) + r(x)=0$ where $\mu_1(x)$ and $\Sigma_1(x)$ are empirical drift and diffusion estimates from data (Zhu, 2024). Higher-order variants and model-free Galerkin algorithms provide provably higher-order $O(\Delta t^i)$ accuracy, robust to system smoothness and discretization artifacts. The smoothing impact appears explicitly through the second-order (diffusion) term, systematically improving accuracy relative to discrete-time Bellman recursion.

7. Theoretical Guarantees, Bias–Variance Trade-offs, and Applications

Smoothing in Bellman equations enables the following:

Contraction and Uniqueness: The smoothed operators (entropy-regularized, log-sum-exp, exponential, Gaussian convolution, or PDE-based) often preserve—or extend to—Banach contraction properties, guaranteeing unique fixed points.
Differentiability and Stability: Smoothing renders the operator differentiable (often with Lipschitz gradient), enabling stable and convergent gradient-based optimization even under nonlinear approximators—critical for RL algorithms such as SBEED and those employing actor-critic architectures.
Bias–Variance Control: All smoothing methods introduce a bias (approximation gap to the non-smooth optimum) that vanishes as the smoothing parameter tends to zero, while reducing variance and improving sample efficiency.
Improved Sample Complexity: In RL with function approximation, smoothing has been shown to yield sample complexity with linear $1/(1-\gamma)$ dependence on the planning horizon and optimal $1/\sqrt{n}$ statistical rates in realizable cases (Touati et al., 2020). In risk-sensitive RL, exponential smoothing tightens regret bounds by reducing horizon blow-up factors (Fei et al., 2021). In continuous-time policy evaluation, PhiBE achieves higher-order accuracy in $\Delta t$ relative to standard discrete Bellman methods (Zhu, 2024).
Applicability: Smoothed Bellman equations are foundational in continuous-control RL with Gaussian policies (Nachum et al., 2018), in risk-sensitive and robust decision-making, for PDE-based evaluation in continuous-time control, and in stochastic control with delay.

Key Smoothed Bellman Equations and Methods: Overview Table

Smoothing Type	Canonical Operator / Equation	Notable Contexts
Entropy / Softmax	$T_\lambda V(s) = \lambda\log\sum_a\exp\big(\tfrac{1}{\lambda}[R+\gamma V])$	SBEED, general RL (Dai et al., 2017, Touati et al., 2020)
Gaussian Convolution	$Q_\Sigma(s,a)=\mathbb{E}_{a'\sim \mathcal{N}(a,\Sigma)}[Q^\pi(s,a')]$	Actor-critic RL (Nachum et al., 2018)
Exponential Transform	$T_{exp}Q(s,a)=(1/\beta)\log\mathbb{E}_{s'}e^{\beta(r+V(s'))}$	Risk-sensitive RL (Fei et al., 2021)
PDE/Generator-based	$\mu(x)\cdot\nabla V+\frac{1}{2}\Sigma(x):\nabla^2V-\gamma V+r=0$	Continuous-time RL (Zhu, 2024)
Partial Smoothing Semigroup	Ornstein–Uhlenbeck, mild-dynamics in $C_b^{0,1;B}$	HJB with delay (Gozzi et al., 2015, Gozzi et al., 2016)

Each formulation is tightly linked to the statistical and analytic requirements of its application domain.

References

SBEED, entropy-regularized, and smoothed RL: (Dai et al., 2017, Touati et al., 2020)
Gaussian smoothed Q-functions: (Nachum et al., 2018)
Exponential (entropic) Bellman: (Fei et al., 2021)
Partial smoothing for control with delay: (Gozzi et al., 2015, Gozzi et al., 2016)
PDE-based smoothing (PhiBE): (Zhu, 2024)