Evolution Strategies (ES): Scalable Optimization

Updated 13 April 2026

Evolution Strategies (ES) are population-based optimization methods that adapt probabilistic search distributions solely using objective function evaluations.
Variants such as NES and CMA-ES integrate techniques like natural gradients and covariance adaptation to efficiently tackle high-dimensional and noisy optimization challenges.
ES methods excel in scalability and parallelization, making them valuable for applications in reinforcement learning, robotics, and robust control.

Evolution Strategies (ES) are a class of population-based, black-box optimization algorithms that iteratively adapt a probabilistic search distribution over candidate solutions in parameter space, using only evaluations of the objective function. ES were originally conceived for real-parameter optimization in continuous domains, but over the past decade they have emerged as a scalable and competitive alternative to gradient-based reinforcement learning (RL) and black-box optimization in high-dimensional, non-differentiable, and parallel computing settings. Modern ES variants—including canonical (μ,λ)-ES, Natural Evolution Strategies (NES), Covariance Matrix Adaptation ES (CMA-ES), and their distributed implementations—have demonstrated state-of-the-art performance on deep RL benchmarks, challenging traditional policy-gradient and Q-learning methods and sparking renewed interest in their algorithmic foundations, theoretical properties, and practical enhancements (Salimans et al., 2017, Chrabaszcz et al., 2018).

1. Core Algorithmic Principles and Objective Formulation

ES treat the objective function $J(\theta)$ (e.g., the expected return of a parameterized policy $\pi_\theta$ ) as a black box and optimize it by stochastic search and selection in parameter space. The typical ES loop involves:

Sampling a set of perturbations $\{\epsilon_i\}_{i=1}^N$ from an isotropic multivariate normal $\mathcal{N}(0, I)$ .
Evaluating the objective $J(\theta + \sigma \epsilon_i)$ for each perturbed candidate.
Estimating the search gradient using the score-function estimator:

$\nabla_\theta J(\theta) \approx \frac{1}{\sigma N} \sum_{i=1}^N J(\theta + \sigma \epsilon_i)\,\epsilon_i.$

Updating the search distribution mean, typically via stochastic gradient ascent:

$\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta).$

Variants introduce rank-normalization or weighted recombination for robustness, with weights $w_j$ decreasing with the ranking of each candidate’s fitness (Chrabaszcz et al., 2018, Salimans et al., 2017).

2. Algorithmic Variants: Canonical ES, NES, and CMA-ES

Canonical (μ,λ)-ES

The canonical (μ,λ)-ES maintains a population of λ offspring, selects the top μ according to fitness, and recombines them (typically via a weighted mean of the associated perturbations) to update the mean. It is highly parallelizable: all rollouts are independent and only scalar returns need to be communicated, facilitating scaling to thousands of cores (Salimans et al., 2017).

Natural Evolution Strategies (NES)

NES optimize $J(\theta) = \mathbb{E}_{\epsilon \sim \mathcal{N}(0, I)} [f(\theta + \sigma \epsilon)]$ and use natural gradients, imposing a KL trust-region constraint on the search distribution. The NES update takes the form:

$\delta \theta \propto -F_\theta^{-1} \nabla_\theta J(\theta),$

where $\pi_\theta$ 0 is the Fisher information matrix of the search distribution. In high-dimensional problems, diagonal SNES is typically employed for computational efficiency (Lenc et al., 2019).

Covariance Matrix Adaptation Evolution Strategy (CMA-ES)

CMA-ES extends NES by explicitly adapting a covariance matrix $\pi_\theta$ 1 and a global step size $\pi_\theta$ 2 (often through path-length–based adaptation). This allows the algorithm to align search directions with the local geometry of the optimization landscape. Empirically, CMA-ES and related variants are the reference ES family for expensive, ill-conditioned, and high-dimensional real-parameter black-box problems (Müller et al., 2018).

3. Distinctive Properties and Robustness

ES algorithms differ fundamentally from classic finite-difference approximations. While both utilize perturbation-based gradient estimates, ES optimize the expected fitness under a fixed-variance parameter distribution:

$\pi_\theta$ 3

rather than the gradient at a point. This makes ES seek regions of parameter space robust to perturbations, inherently biasing toward flat maxima ("robustness-seeking") (Lehman et al., 2017). Empirical measurements show that policies optimized by ES are significantly more robust to parameter noise than those found by policy gradient methods or mutation-only GAs (e.g., retaining ~80% performance under noise, versus ~25% for TRPO) (Lehman et al., 2017).

This robustness comes at a cost: when the optimum is in a narrow peak or along a ridge, high-σ ES can stall unless σ is annealed. Conversely, in multimodal or deceptive landscapes, robustness-seeking allows ES to escape fleeting peaks and integrate over wide basins.

4. Parallelization, Scalability, and Sample Efficiency

One of the primary algorithmic strengths of ES is perfect parallelization. Each perturbation is independent, and with the "common random numbers" communication trick, only scalar fitness values need to be transmitted from workers to the master process. This design enables near-linear scaling with compute resources—solving, e.g., 3D humanoid walking in 10 minutes using 720 CPU cores (Salimans et al., 2017).

However, sample efficiency remains a limitation. ES generally require more environment interactions than policy-gradient RL methods, as each update discards the sampled trajectories. Extensions such as Importance Weighted ES (IW-ES) improve sample efficiency by performing multiple parameter updates per batch using importance sampling corrected by the overlap of search distributions. With proper step-size and importance weighting, IW-ES can increase data efficiency and reduce wall-clock convergence time, though excessive updates can destabilize training (Campos et al., 2018).

5. Modular Extensions, Exploration, and Practical Applications

ES have been hybridized with directed exploration methods (e.g., novelty search, quality diversity) to handle sparse and deceptive RL domains. Algorithms like NS-ES, NSR-ES, and NSRA-ES modify the update rule to maximize novelty or a reward-novelty mixture, using meta-populations and archives to promote exploration and avoid local optima (Conti et al., 2017).

Structural modularization of ES is supported by extracting discrete “modules” (e.g., mirrored or orthogonal sampling, elitism, pairwise selection, population restarts), allowing automated algorithm configuration tailored to problem classes. Recent work demonstrates that such modular recombination can yield architectures outperforming classic CMA-ES variants on standardized benchmarks (Rijn et al., 2016).

Distributed ES implementations have enabled the training of massive models (with up to billions of parameters) using low-rank perturbation techniques (e.g., EGGROLL), reducing the memory and computation of per-worker updates by $\pi_\theta$ 4 per layer and matching full ES updates at $\pi_\theta$ 5 accuracy (Sarkar et al., 20 Nov 2025).

Applications extend beyond RL: ES have been successfully applied to non-differentiable supervised learning tasks with hybrid SGD+ES algorithms for discrete parameter optimization (e.g., sparsity masks) (Lenc et al., 2019), quantum neural network training (providing evaluation-budget–agnostic, dimension-independent optimization compared to parameter-shift rules) (Friedrich et al., 2022), and policy refinement stages with bounded, antithetic perturbations for robust robotic manipulation (Hirschowitz et al., 13 Nov 2025).

6. Theoretical Properties and Convergence Analysis

Recent theoretical advances provide rigorous guarantees for the global linear convergence of ES and related step-size adaptive schemes with covariance adaptation. The (1+1) $\pi_\theta$ 6-ES, which includes the classic one-fifth success rule and comparison-based covariance updates, achieves geometric (linear-in-iteration) convergence on quadratically bounded (strongly convex up to monotonic warp) and positively homogeneous functions (Akimoto et al., 2020).

Further, rigorous convergence rates are established for the $\pi_\theta$ 7 ES with weighted recombination on scaling-invariant (homogeneous) function classes, linking convergence precisely to the expected log step-size increment on linear test functions. This unification clarifies the regimes where CMA-ES and related ES attain or fail to attain global optimality, and provides practical guidance for tuning population and recombination settings (Touré et al., 2021).

Mathematically, as dimensionality $\pi_\theta$ 8, the ES gradient estimator converges to a normalized finite-difference estimator; statistical concentration properties of the Gaussian norm underpin this equivalence (Raisbeck et al., 2019). Theoretically, covariance accumulation by selection alone in (1,λ)-ES aligns the statistical covariance of winners to the principal directions of the local Hessian, laying a foundation for derandomized adaptation schemes in CMA-ES (Shir et al., 2016).

7. Challenges, Limitations, and Future Prospects

While ES are robust, parallelizable, and flexible, they face inherent sample efficiency limitations and difficulties in high-noise, high-dimensional, or memory-dependent RL environments. ES can struggle with partial observability, require large populations for reliable adaptation, and may plateau due to vanishing signal-to-noise ratios in high dimensions or under additive noise (Müller et al., 2018). Adaptive population sizing, uncertainty-aware selection, and subspace or limited-memory metrics (e.g., LM-MA-ES) are necessary to ensure practical efficacy at modern neural network scales.

Emerging directions include ES variants with structure-aware control variates for variance reduction via underlying MDP structure (Tang et al., 2019), deep generative ES using flexible search distributions parameterized by invertible neural networks (Faury et al., 2019), and modular structures that automatically configure and evolve ES variants for new domains (Rijn et al., 2016). Hybridization with gradient-based and model-based methods for sample efficiency, as well as automated uncertainty handling in massively parallel settings, remains an active area of research.

In summary, Evolution Strategies provide theoretically-justified, compute-scalable, and algorithmically flexible approaches for black-box optimization in large, noisy, or non-differentiable domains, with ongoing advances addressing sample efficiency, controlled exploration, and domain adaptation (Salimans et al., 2017, Chrabaszcz et al., 2018, Campos et al., 2018, Lehman et al., 2017, Müller et al., 2018, Akimoto et al., 2020, Sarkar et al., 20 Nov 2025).