Evolution Strategies Smoothing Estimators

Updated 16 April 2026

Evolution Strategies smoothing estimators are derivative-free methods that use random perturbations and smoothing kernels to approximate gradients for black-box optimization.
They employ diverse kernels like Gaussian and triangular distributions, enhanced with antithetic sampling and quadrature rules to reduce bias and variance.
These techniques integrate smoothly into optimization pipelines for reinforcement learning, high-dimensional search, and meta-optimization, improving convergence rates and robustness.

Evolution Strategies (ES) smoothing estimators are a class of derivative-free optimization techniques based on random perturbation and smoothing of the objective function. They generate search directions by perturbing parameters, evaluating objective responses, and aggregating these results via finite-difference-like or kernel-based estimators. While classical ES typically use Gaussian-distributed perturbations, modern ES smoothing estimators employ a range of perturbation laws, estimator constructions, and variance-reduction mechanisms—shaping the landscape of black-box optimization, policy search in reinforcement learning, and, increasingly, variational inference and meta-optimization.

1. Smoothing Kernels and Perturbation Laws

The smoothing kernel in ES determines the probability distribution from which parameter perturbations are sampled. The canonical approach uses an isotropic Gaussian kernel, with the search distribution $q(u) = (2\pi \sigma^2)^{-d/2} \exp(-\|u\|^2/(2\sigma^2))$ , smoothing the objective over $\mathbb{R}^d$ (Lehman et al., 2017). The gradient of the smoothed objective with respect to parameters $\theta$ is estimated via

$\nabla_\theta J_\sigma(\theta) = \frac{1}{\sigma} \mathbb{E}_{u \sim \mathcal N(0, I)} \bigl[ f(\theta + \sigma u) \cdot u \bigr].$

This construction underlies both classic Monte Carlo estimators and more advanced structured variants.

Recent work introduces kernels of bounded support, most notably the symmetric triangular distribution. For example, the Triangular-Distribution ES (TD-ES) defines the per-coordinate kernel as

$q_\triangle(\varepsilon_k) = \begin{cases} \frac{1}{\sigma_{\mathrm{ES}}} \left(1 - \frac{|\varepsilon_k|}{\sigma_{\mathrm{ES}}}\right), & |\varepsilon_k| \leq \sigma_{\mathrm{ES}} \ 0, & \text{otherwise} \end{cases}$

with $\varepsilon = (\varepsilon_1, ..., \varepsilon_d)$ sampled independently per dimension (Hirschowitz et al., 13 Nov 2025). This yields perturbations strictly within a “soft trust region,” controlling parameter excursions and concentrating sample mass near zero.

Some ES variants employ directional or axis-aligned kernels: Directional Gaussian Smoothing (DGS-ES) constructs one-dimensional smoothing operators along each coordinate or an orthogonal basis, enabling nonlocal exploration and facilitating analytic quadrature (Zhang et al., 2020, Zhang et al., 2020).

2. Monte Carlo, Antithetic, and Quadrature-Based Gradient Estimation

The classical ES estimator forms a Monte Carlo finite-difference average using i.i.d. perturbations. With $N$ samples,

$\widehat{g} = \frac{1}{N \sigma} \sum_{i=1}^N f(\theta + \sigma \varepsilon_i) \varepsilon_i,$

where $\varepsilon_i \sim q$ . To reduce estimator variance, antithetic sampling evaluates both $\theta+\sigma\varepsilon_i$ and $\mathbb{R}^d$ 0, exploiting the oddness in the objective’s Taylor expansion: $\mathbb{R}^d$ 1 This halves variance by canceling even-order noise terms (Hirschowitz et al., 13 Nov 2025, Vicol et al., 2023, Meier et al., 2019).

For non-Gaussian or directional smoothing, high-order quadrature rules—especially Gauss-Hermite—yield deterministic, low-variance gradient estimates. In DGS-ES, each one-dimensional integral along coordinate $\mathbb{R}^d$ 2 is approximated by (Zhang et al., 2020, Zhang et al., 2020): $\mathbb{R}^d$ 3 where $\mathbb{R}^d$ 4 are nodes and weights of the Hermite polynomial of order $\mathbb{R}^d$ 5.

Centering and normalization of reward signals—such as centered-rank transforms—further reduce estimator variance and improve shift- and scale-invariance (Hirschowitz et al., 13 Nov 2025).

3. Bias, Variance, and Locality Properties

The bias-variance properties of ES smoothing estimators stem from the smoothing kernel’s support and the estimation protocol:

Bias: All smoothers convolve $\mathbb{R}^d$ 6 with their kernel, introducing a bias of $\mathbb{R}^d$ 7 under standard $\mathbb{R}^d$ 8 regularity (Lehman et al., 2017, Hirschowitz et al., 13 Nov 2025, Zhang et al., 2020). Directional or bounded-support kernels can exert additional bias by localizing sampling to near-linear regions and enforcing maximum per-parameter changes.
Variance: ES estimators based on Gaussian MC samples have variance scaling inversely with the number of samples but growing with problem dimension. Bounded-support kernels (e.g., triangular) and antithetic sampling yield substantial variance reductions—TD-ES achieves an 83.1% reduction in empirical studies compared to standard Gaussian ES (Hirschowitz et al., 13 Nov 2025). In DGS-ES, quadrature renders estimator variance negligible except for observation noise (Zhang et al., 2020).
Locality: Bounded kernels enforce hard exploration limits, effectively instituting a “soft trust region” per update. The triangular law's mode at zero accentuates local, low-variance update steps, leading to robust convergence in the late, high-precision stage of policy refinement.

4. Algorithmic Scheduling and Practical Integration

Smoothing estimators are often positioned within broader optimization pipelines, enabling efficient exploration, robust refinement, and compatibility with gradient-based methods.

A representative schedule involves two stages (Hirschowitz et al., 13 Nov 2025):

Baseline Pre-Training: Use an on-policy gradient method (e.g., PPO) until performance plateaus or a preset threshold is met.
ES-Based Refinement: Switch to TD-ES (or similar ES smoothing estimator), inheriting parameters from PPO and allocating the remainder of environment steps to variance-reduced, local exploration. Parameters are updated by

$\mathbb{R}^d$ 9

with $\theta$ 0 the TD-ES (antithetic, centered-rank) update and $\theta$ 1 decayed geometrically or adaptively.

Practical implementations select the smoothing scale, step size, and, when using directional kernels, orthonormal bases and quadrature orders for an appropriate efficiency-accuracy trade-off (Zhang et al., 2020, Hirschowitz et al., 13 Nov 2025).

5. Theoretical Foundations and Convergence Guarantees

The foundations of ES smoothing estimators rely on kernel convolution, Monte Carlo estimation, and control-variate analysis. Under mild regularity, the smoothed gradient approximates the true gradient with a second-order error in the smoothing radius. For symmetric, finite-moment kernels (e.g., triangular or Gaussian), Taylor expansion guarantees bias is $\theta$ 2 (Hirschowitz et al., 13 Nov 2025, Lehman et al., 2017, Vicol et al., 2023).

Variance reduction is supported theoretically by multiple mechanisms:

Antithetic sampling eliminates leading stochastic noise terms.
Centered ranking matches the control variate effect, ensuring invariance to baseline shifts in the objective.
Quadrature-based integration achieves spectral accuracy in smooth integrands, making bias arbitrarily small with sufficient quadrature order (Zhang et al., 2020, Zhang et al., 2020).
Surrogate-direction projection exploits correlation in descent directions and yields provably improved cosine alignment with the true gradient in both linear and non-linear settings (Meier et al., 2019).

For learning-rate decay and step size adaptation, diminishing $\theta$ 3 and step size sequence $\theta$ 4 ensures convergence to local optima, with convergence rates governed by the estimator’s variance and the problem’s dimensionality (Hirschowitz et al., 13 Nov 2025).

6. Empirical Benchmarks and Domain Impact

ES smoothing estimators, particularly those with advanced smoothing kernels and variance-reduction schemes, deliver marked improvements in both synthetic and real-world tasks:

Robotic Manipulation: PPO→TD-ES pipelines improve success rates by up to 26.5% over PPO alone, with systematic variance reduction and robust final-stage refinement. In ablations, triangular smoothing achieves the highest mean success and lowest variance, especially in precision tasks (Hirschowitz et al., 13 Nov 2025).
Reinforcement Learning Benchmarks: DGS-ES converges much faster than vanilla ES or other baselines on MountainCarContinuous, Pendulum, Hopper, and others. Large smoothing radii prove essential for escaping local minima (Zhang et al., 2020).
High-dimensional Black-box Optimization: Directional quadrature-based ES drastically improves convergence and estimator stability for nonconvex problems in up to $\theta$ 5 (e.g., Ackley, Rastrigin), and in engineering PDE-constrained applications (Zhang et al., 2020).
Meta-optimization and Long-horizon Learning: In unrolled optimization graphs, the ES-Single approach exhibits variance independent of unroll length, in contrast to Persistent ES—enabling stable meta-learning in regimes where classical ES, BPTT, or PES struggle (Vicol et al., 2023).

Hybrid approaches—such as SV-ES, which blends ES smoothing with Stein variational updates—expand the utility of ES smoothing estimators to inference over unnormalized densities, further hybridizing the derivative-free and variational inference toolkits (Braun et al., 2024).

7. Extensions, Control Variates, and Variance-Reduction Enhancements

Advanced smoothing estimators exploit structural properties of the problem or data to further reduce variance:

Structured control variates combine standard ES and reparameterization-based estimators, optimally blending them via per-coordinate adaptive weights to minimize estimator variance while maintaining unbiasedness (Tang et al., 2019).
Surrogate directions and memory: Reuse of past update directions or surrogates, orthogonalization, and projection in high-dimensional settings accelerates alignment with the true gradient and hedges against noisy objective evaluations (Meier et al., 2019).

Such methodological advances extend the practical reach of ES smoothing estimators to increasingly high-dimensional, stochastic, and ill-posed optimization tasks—achieving superior efficiency, robustness, and reliability compared to both classical black-box search and naive finite-difference schemes.