Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 102 tok/s
Gemini 2.5 Pro 58 tok/s Pro
GPT-5 Medium 25 tok/s
GPT-5 High 35 tok/s Pro
GPT-4o 99 tok/s
GPT OSS 120B 472 tok/s Pro
Kimi K2 196 tok/s Pro
2000 character limit reached

CMA-ES: Covariance Matrix Adaptation Evolution Strategy

Updated 31 August 2025
  • CMA-ES is a state-of-the-art optimization algorithm that uses a multivariate normal distribution to sample candidate solutions and adapt its covariance matrix.
  • It integrates rank-one and rank-μ updates with cumulative step-size adaptation to efficiently capture local landscape curvature and ensure affine invariance.
  • The method demonstrates robust performance in diverse applications, with extensions to large-scale, discrete, and multi-modal optimization challenges.

The Covariance Matrix Adaptation Evolution Strategy (CMA-ES) is a state-of-the-art stochastic, derivative-free optimization algorithm for continuous, non-linear, and potentially ill-conditioned or multi-modal objective functions. CMA-ES iteratively adapts a multivariate normal sampling distribution over the search space, exploiting estimates of the local landscape curvature by learning and updating its covariance structure. The method is widely regarded for its invariance properties, self-adaptation capabilities, and robust performance across diverse benchmark and real-world optimization tasks.

1. Algorithmic Foundation and Design

CMA-ES maintains a search distribution N(m,σ2C)N(m, \sigma^2 C), where mRnm \in \mathbb{R}^n is the mean (search center), σ>0\sigma > 0 is the global step-size, and CRn×nC \in \mathbb{R}^{n \times n} is the symmetric positive definite covariance matrix. At each iteration, a population of λ\lambda candidate solutions is sampled:

xk=m+σyk,ykN(0,C),k=1,,λx_k = m + \sigma \cdot y_k, \qquad y_k \sim N(0, C), \qquad k = 1,\ldots, \lambda

The top μ\mu candidates, ranked by objective function value, are selected, and the mean is updated via weighted recombination:

m(g+1)=i=1μwixi:λ(g+1)m^{(g+1)} = \sum_{i=1}^{\mu} w_i x_{i:\lambda}^{(g+1)}

The covariance matrix adaptation comprises two principal components:

  • Rank-μ\mu update: Incorporates weighted sample covariance among successful steps.
  • Rank-one update: Uses the evolution path pcp_c to accumulate improvements in consecutive directions, encoding long-term progress in the search.

The complete covariance update is:

C(g+1)=(1c1cμ)C(g)+c1pc(g+1)pc(g+1)+cμi=1μwiyi:λ(g+1)yi:λ(g+1)C^{(g+1)} = (1 - c_1 - c_{\mu}) C^{(g)} + c_1 p_c^{(g+1)} p_c^{(g+1)\top} + c_{\mu} \sum_{i=1}^{\mu} w_i y_{i:\lambda}^{(g+1)} y_{i:\lambda}^{(g+1)\top}

where c1c_1, cμc_{\mu} are learning rates, and yi:λ(g+1)=(xi:λ(g+1)m(g))/σ(g)y_{i:\lambda}^{(g+1)} = (x_{i:\lambda}^{(g+1)} - m^{(g)})/\sigma^{(g)}.

Step-size adaptation is controlled via the Cumulative Step-Size Adaptation (CSA) mechanism, which maintains a second evolution path pσp_{\sigma} to adapt σ\sigma:

σ(g+1)=σ(g)exp(cσdσ(pσ(g+1)En1))\sigma^{(g+1)} = \sigma^{(g)} \exp \left( \frac{c_{\sigma}}{d_{\sigma}} \left( \frac{||p_{\sigma}^{(g+1)}||}{E_n} - 1 \right) \right)

where EnE_n is the expectation of the norm of a standard nn-dimensional Gaussian vector, and cσc_{\sigma}, dσd_{\sigma} are step-size learning and damping rates.

2. Information Geometric and Natural Gradient Perspectives

CMA-ES can be rigorously interpreted through the lens of information geometry by regarding the family of multivariate normal distributions as a Riemannian manifold with the Fisher metric. The algorithm performs expected fitness maximization on this manifold, with steepest ascent given by the natural gradient:

~J(θ)=F(θ)1J(θ)\tilde{\nabla} J(\theta) = F(\theta)^{-1} \nabla J(\theta)

where F(θ)F(\theta) is the Fisher information matrix. For multivariate normals with direct parameterization, the natural gradient can be computed without explicit inversion of F(θ)F(\theta), yielding update rules that coincide with the "rank-μ\mu" only CMA-ES. The result is that CMA-ES steps in the direction of steepest expected fitness improvement with respect to the geometry of the statistical manifold, ensuring affine invariance and efficient adaptation (Akimoto et al., 2012).

This framework justifies the use of constant learning rates and provides monotonic improvement conditions by analyzing the lower bound of the expected fitness, drawing close connections to generalized expectation-maximization procedures.

3. Covariance Learning and Second-Order Landscape Recovery

In the vicinity of a quadratic optimum, the distribution of sampling points approaches:

E[xx]=C=H1\mathbb{E}[x x^\top] = \mathcal{C} = \mathcal{H}^{-1}

where H\mathcal{H} is the Hessian of the objective. Hence, CMA-ES learns an empirical approximation to the inverse Hessian through covariance adaptation. However, in high-dimensional or highly ill-conditioned landscapes, the default covariance learning rate (scaling as 1/n21/n^2) and strong contraction of the step-size near convergence can prevent accurate recovery of curvature information. Forced Optimal Covariance Adaptive Learning (FOCAL) mitigates this by increasing the covariance learning rate (ccov[0.01,0.1]c_{\mathrm{cov}} \in [0.01, 0.1]) and forcibly bounding the step-size away from zero via:

σ(g)=σ0(λmin(g))α\sigma^{(g)} = \frac{\sigma_0}{(\lambda_{\min}^{(g)})^\alpha}

where λmin(g)\lambda_{\min}^{(g)} is the smallest eigenvalue of C(g)C^{(g)} and α(0,0.5)\alpha \in (0, 0.5). This modification maintains significant sampling in all directions near the optimum, enabling high-fidelity Hessian estimation even in high-dimensional settings (Shir et al., 2011).

FOCAL has been experimentally validated for both modeled landscapes and quantum control tasks, demonstrating its utility for robust Hessian recovery, sensitivity analysis, and uncovering physical mechanism signatures in experimental data.

4. Invariance Properties and Practical Robustness

CMA-ES exhibits:

  • Monotonic transformation invariance: The ranking-based selection ignores any monotonic scaling of the objective.
  • Affine invariance: Performance is unaffected by translations, rotations, or scaling transformations of the search space, provided the initial distribution is transformed accordingly.
  • Scale and rotation invariance in updates: Owing to adaptation of the full covariance structure, the method performs robustly on non-separable and rotated problems.

Robustness to noise is further enhanced by variants such as learning rate adaptation (LRA-CMA-ES), which maintains a constant signal-to-noise ratio in the update:

SNR=E[Δ]2Tr(FCov[Δ])αη\text{SNR} = \frac{\|\mathbb{E}[\Delta]\|^2}{\operatorname{Tr}(F \cdot \operatorname{Cov}[\Delta])} \approx \alpha \eta

where η\eta is the learning rate and Δ\Delta is the update. Online adaptation of η\eta enables the algorithm to avoid over-aggressive or insufficient adaptation under noisy or highly multi-modal conditions, eliminating the need for expensive manual tuning (Nomura et al., 2023, Nomura et al., 29 Jan 2024).

5. Extensions: Large-Scale, Discrete, and Multi-Modal Optimization

To address the quadratic time and space complexity in high dimensions (O(n2))(\mathcal{O}(n^2)), low-memory variants such as LM-MA-ES restrict the adaptation to a set of m=O(logn)m = \mathcal{O}(\log n) direction vectors, achieving O(nlogn)\mathcal{O}(n \log n) complexity while maintaining near-parity in solution quality (Loshchilov et al., 2017). Cholesky-based or rank-one approximation methods further reduce computational overhead without sacrificing adaptation fidelity (Li et al., 2017).

Discrete and mixed-integer CMA-ES versions use multivariate binomial (Bernoulli) distributions, retaining the ability to model variable interactions via exponential family parameterizations and extending the method to problems where search variables are fundamentally discrete (Benhamou et al., 2018, Watanabe et al., 2023).

For multi-modal optimization, modifications incorporate niching strategies and dynamic adaptation of population size to maintain sub-populations ("niches") around multiple optima. Fitness assignment aggregates contributions to all known or hypothesized optima, and proximity-based weighting ensures population diversity, mitigating premature convergence (Karunarathne et al., 1 Jul 2024). On benchmark suites for multi-modal functions, such methods exhibit rapid convergence to all global optima, high peak ratio, and stable F1 scores across variable dimensions.

6. Applications in Machine Learning and Beyond

CMA-ES has found widespread application in black-box optimization for:

  • Hyperparameter optimization: Leveraging invariance to scalings and monotonic transformations, it excels at high-dimensional, noisy deep learning hyperparameter search, and its implementation enables efficient parallel evaluation (Loshchilov et al., 2016).
  • Neural architecture search (NAS): Encoding architectures as Euclidean vectors and updating the search distribution based on surrogate model predictions allows significant reductions in search cost while achieving competitive accuracy on benchmarks such as CIFAR-10/100 and ImageNet (Sinha et al., 2021).
  • Quality diversity ("illumination"): Hybrid algorithms combining CMA-ES adaptation with MAP-Elites archiving illuminate the behavior space by maintaining diversified archives of high-quality solutions, beating classic MAP-Elites in both solution quality and diversity metrics (Fontaine et al., 2019).
  • Experimental physics and engineering: FOCAL and classic CMA-ES have enabled landscape probing and sensitivity analysis in quantum control experiments, design optimization, and structural engineering—especially where derivative information is inaccessible.

7. Comparative Analysis and Theoretical Guarantees

CMA-ES's theoretical foundation—rooted in information geometry—supports its equivalence (under appropriate conditions) with natural gradient ascent and establishes ranges for learning rates leading to monotonic expected fitness improvement (Akimoto et al., 2012). Comparative experiments against Bayesian optimization, tree-based estimators, and traditional evolutionary algorithms indicate strong performance, especially when high evaluation budgets or parallelism are available (Loshchilov et al., 2016).

Extensions with online hyperparameter self-adaptation further eliminate manual tuning overhead and dynamically adjust key learning parameters in response to problem difficulty (Loshchilov et al., 2014). For multimodal, noisy, and ill-conditioned benchmarks, learning rate and reevaluation adaptation (e.g., RA-CMA-ES) substantially improve reliability and convergence speed, with explicit mechanisms linking the number of reevaluations to the empirical correlation between stochastic update directions, thereby ensuring robustness under multiplicative noise (Uchida et al., 19 May 2024).


In summary, CMA-ES and its modern extensions constitute a mathematically principled, practically robust, and highly flexible framework for black-box optimization in real-valued, discrete, or mixed search spaces, with strong empirical performance and deep connections to information geometry, natural gradient methods, and unsupervised estimation of landscape curvature. Continued research addresses scalability, noise handling, niche maintenance in multi-modal settings, and integration with surrogate models and deep learning frameworks.