Soft Deterministic Policy Gradient with Gaussian Smoothing

Published 7 May 2026 in cs.LG and cs.AI | (2605.06228v1)

Abstract: Deterministic policy gradient (DPG) is widely utilized for continuous control; however, it inherently relies on the differentiability of the critic with respect to the action during policy updates. This assumption is violated in practical control problems involving sparse or discrete rewards, leading to ill-defined policy gradients and unstable learning. To address these challenges, we propose a principled alternative based on a smoothed Bellman equation formulated via Gaussian smoothing. Specifically, we define a novel action-value function based on a smoothed Bellman equation and derive the soft deterministic policy gradient (Soft-DPG). Our formulation eliminates explicit dependence on critic action-gradients and ensures that the gradient remains well-defined even for non-smooth Q-functions. We instantiate this framework into a deep reinforcement learning algorithm, which we call soft deep deterministic policy gradient (Soft DDPG). Empirical evaluations on standard continuous control benchmarks and their discretized-reward variants show that Soft DDPG remains competitive in dense-reward settings and provides clear gains in most discretized-reward environments, where standard DDPG is more sensitive to irregular critic landscapes.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces Soft-DPG, leveraging Gaussian smoothing at the Bellman level to eliminate explicit action-gradient dependency for stable policy optimization.
It formulates Soft-DDPG, which employs Monte Carlo sampling of perturbed actions to robustly update policies, showing significant gains over standard DDPG in discrete reward environments.
Empirical results on MuJoCo and Gym benchmarks confirm that Soft-DPG enhances stability and improves final returns, particularly under non-smooth and sparse reward conditions.

Soft Deterministic Policy Gradient with Gaussian Smoothing: A Technical Essay

Motivation and Theoretical Foundations

Deterministic Policy Gradient (DPG) algorithms have dominated continuous control in RL due to their efficient, low-variance gradient estimation. However, the key assumption underlying DPG—the differentiability of the critic with respect to action—fails in practical scenarios involving sparse or discretized rewards, which often produce irregular, non-smooth action-value surfaces. This induces instability and ill-defined gradients during actor updates. The paper introduces Soft Deterministic Policy Gradient (Soft-DPG), leveraging Gaussian Smoothing (GS) at the Bellman operator level to establish a robust policy gradient formulation that eliminates explicit dependency on critic action-gradients.

The central theoretical contribution is the introduction of a $\sigma$ -smoothed Bellman equation. This constructs an action-value function, $Q_\sigma^\pi$ , directly as the fixed point of a smoothed Bellman expectation operator. $Q_\sigma^\pi$ is constructed to ensure well-behaved, stable gradients, even when the underlying MDP produces non-smooth Q-functions. Analytical upper bounds are established for the bias between $Q^\pi$ and $Q_\sigma^\pi$ , with the bias controllable by the smoothing parameter $\sigma$ and dependent on Lipschitz continuity constants of the reward and transition dynamics.

Algorithmic Instantiation: Soft-DDPG

The theory is instantiated into a practical algorithm, Soft Deep Deterministic Policy Gradient (Soft-DDPG). Unlike standard DDPG—which estimates actor gradients using $\nabla_a Q(s,a)$ —Soft-DDPG instead generates policy updates via Monte Carlo sampling from Gaussian-perturbed actions, evaluated on the learned $Q_\sigma^\pi$ . Actor updates rely only on the evaluations of $Q_\sigma^\pi$ at perturbed actions, obviating the need for an explicit action-gradient. The critic is trained via targets sampled from the smoothed Bellman operator, ensuring consistency with the theoretical fixed point.

This architecture directly addresses the instability observed when operating in environments with discrete rewards, as demonstrated by visualizations showing the pathologies of critic action-gradients under DDPG. The landscape produced by Soft-DDPG is markedly smoother and yields more informative gradient signals for policy improvement.

Empirical Validation and Numerical Analysis

Extensive empirical evaluations are performed on standard MuJoCo benchmarks and their discretized-reward variants. Results indicate that Soft-DDPG maintains competitive performance in dense-reward settings and consistently outperforms standard DDPG in most environments with discrete or sparse reward surfaces. There is clear evidence that, in discrete-reward tasks such as Ant, HalfCheetah, Hopper, Walker2d, Inverted Pendulum, and Inverted Double Pendulum, Soft-DDPG demonstrates superior stability and final returns. Notably, numerical results indicate absolute improvements in mean scores under discrete reward settings—e.g., in Ant (Discrete) environments, mean scores increase from $190.56 \pm 60.23$ (DDPG) to $Q_\sigma^\pi$ 0 (Soft-DDPG).

Figure 1: Comparative performance of DDPG and Soft-DDPG across continuous and discrete reward variants, showing robust gains in non-smooth environments.

Sensitivity analysis shows optimal performance when $Q_\sigma^\pi$ 1 Monte Carlo samples and $Q_\sigma^\pi$ 2 for Gaussian smoothing. Both parameters are critical; too small $Q_\sigma^\pi$ 3 reintroduces instability, while too large $Q_\sigma^\pi$ 4 causes excessive smoothing bias.

Figure 2: Sensitivity to the number of samples $Q_\sigma^\pi$ 5 ( $Q_\sigma^\pi$ 6); optimal trade-off occurs at $Q_\sigma^\pi$ 7, eliminating excessive variance.

Additional discrete reward experiments across Gym environments (BipedalWalker, LunarLander, Pendulum, MountainCar) confirm the robustness of Soft-DDPG, consistently outperforming DDPG under challenging reward structures. In continuous reward settings, standard DDPG generally yields better performance, as expected due to introduced bias from smoothing.

Figure 3: Learning curves across Gym continuous and discrete benchmarks; Soft-DDPG achieves improved sample efficiency and stability in discrete cases.

Practical and Theoretical Implications

The primary theoretical implication is that GS at the Bellman operator level produces a Bellman-consistent surrogate Q-function, which is provably Lipschitz and approximates the original Q-function within an explicit bias bound. Practically, Soft-DDPG offers a principled solution for actor-critic RL in settings where reward surfaces are non-smooth, discrete, or sparse—a regime where standard DPG and its deep variants fail. The algorithm is applicable in robotics, autonomous driving, and industrial control, where such reward structures are prevalent and the critic landscape is typically problematic.

The smoothing parameter $Q_\sigma^\pi$ 8 introduces a bias-variance trade-off. With proper tuning, Soft-DDPG remains competitive in dense reward environments but is clearly advantageous in discrete/sparse reward settings. However, excessive smoothing can degrade performance and its selection is environment-dependent.

Limitations and Future Directions

While Soft-DDPG addresses the deficiencies of DPG under non-smooth rewards, it inherits the typical limitations of deep actor-critic methods, including sensitivity to hyperparameters and lack of convergence guarantees. Tuning $Q_\sigma^\pi$ 9 and the number of smoothing samples requires environment-specific exploration and is non-trivial. Further, while GS provides theoretical smoothness, it may introduce optimization bias in inherently smooth reward landscapes.

Future work could extend the smoothing paradigm to alternative RL formulations (e.g., stochastic policy gradients, multi-agent RL), investigate adaptive smoothing schedules, and integrate smoothing with other regularization techniques for improved sample efficiency.

Conclusion

Soft-DPG, instantiated via Soft-DDPG, rigorously leverages Gaussian smoothing at the Bellman operator level to produce stable, well-defined policy gradients in actor-critic RL, especially under non-smooth reward landscapes. The approach eliminates explicit action-gradient dependence, leading to robust empirical performance in challenging discrete and sparse reward environments. Analytical error bounds guarantee that the smoothing-induced bias remains controllable. Soft-DDPG is a theoretically grounded and practically validated variant that broadens the applicability and reliability of deterministic policy gradient RL in real-world control tasks (2605.06228).

Markdown Report Issue