Papers
Topics
Authors
Recent
Search
2000 character limit reached

CMA-ES: Covariance Matrix Adaptation

Updated 3 April 2026
  • CMA-ES is a stochastic, model-based evolutionary algorithm that adapts a multivariate normal distribution to efficiently explore ill-conditioned, non-separable optimization landscapes.
  • Its iterative updates use natural-gradient steps and cumulative evolution paths to adjust the mean, covariance, and step-size, ensuring invariance under rotation and scaling.
  • Extensions such as LM-CMA-ES, MF-CMA-ES, and PCA-assisted CMA-ES enhance scalability and efficiency for high-dimensional and noisy optimization problems.

Covariance Matrix Adaptation (CMA-ES) is a stochastic, model-based evolutionary algorithm for unconstrained optimization of non-linear, non-convex functions in continuous domains. The defining characteristic of CMA-ES is its adaptive search distribution—a multivariate normal whose covariance matrix is iteratively updated to capture the objective’s underlying structure, enabling efficient search in ill-conditioned, non-separable, and rugged landscapes. CMA-ES provides robust, invariance-rich performance across diverse application domains including machine learning hyperparameter optimization, engineering design, and the rapid illumination of behavioral spaces.

1. The CMA-ES Algorithmic Framework

At iteration tt, CMA-ES maintains (i) a mean vector mtRnm^t\in\mathbb R^n, (ii) a positive-definite covariance matrix CtRn×nC^t\in\mathbb R^{n\times n}, and (iii) a global step-size σt>0\sigma^t>0. λ\lambda candidate solutions are sampled as

xi=mt+σtN(0,Ct),i=1,,λ.x_i = m^t + \sigma^t \mathcal{N}(0, C^t), \quad i=1,\dots,\lambda.

After evaluating f(xi)f(x_i) for all ii, the top μ\mu solutions, sorted by fitness, form weighted steps

yi:λ=(xi:λmt)/σt,y_{i:\lambda} = (x_{i:\lambda} - m^t)/\sigma^t,

where mtRnm^t\in\mathbb R^n0 is the mtRnm^t\in\mathbb R^n1-th best. The mean update uses recombination:

mtRnm^t\in\mathbb R^n2

Covariance adaptation is performed via: mtRnm^t\in\mathbb R^n3 The “evolution path” mtRnm^t\in\mathbb R^n4 cumulates successful steps to capture intergenerational correlations. Cumulative step-size adaptation (CSA) modulates mtRnm^t\in\mathbb R^n5 using a second path mtRnm^t\in\mathbb R^n6, which controls global search radius relative to the length of random walks in isotropic normal space. Default settings for population and learning rates ensure robust performance across a wide range of problems (Hansen, 2016).

Typical parameter settings:

Parameter Default (dimension mtRnm^t\in\mathbb R^n7) Role
mtRnm^t\in\mathbb R^n8 mtRnm^t\in\mathbb R^n9 Offspring count
CtRn×nC^t\in\mathbb R^{n\times n}0 CtRn×nC^t\in\mathbb R^{n\times n}1 Number recombined
CtRn×nC^t\in\mathbb R^{n\times n}2 CtRn×nC^t\in\mathbb R^{n\times n}3 Rank-one covariance update
CtRn×nC^t\in\mathbb R^{n\times n}4 CtRn×nC^t\in\mathbb R^{n\times n}5 Rank-CtRn×nC^t\in\mathbb R^{n\times n}6 update
CtRn×nC^t\in\mathbb R^{n\times n}7 CtRn×nC^t\in\mathbb R^{n\times n}8 Cumulation for CtRn×nC^t\in\mathbb R^{n\times n}9
σt>0\sigma^t>00 σt>0\sigma^t>01 Step-size path
σt>0\sigma^t>02 σt>0\sigma^t>03 Damping σt>0\sigma^t>04

where σt>0\sigma^t>05 (Hansen, 2016).

2. Information-Geometric Foundations and Theoretical Guarantees

CMA-ES is rigorously grounded in information geometry as a stochastic natural gradient ascent on the manifold of multivariate normals, equipped with the Fisher metric. Let σt>0\sigma^t>06. The Fisher information matrix σt>0\sigma^t>07 provides a local metric, and steepest ascent is given by the natural gradient σt>0\sigma^t>08.

Under canonical σt>0\sigma^t>09, the natural gradient direction for each sample decomposes as

λ\lambda0

The CMA-ES update rules for λ\lambda1 and λ\lambda2 exactly implement a gradient step along this direction with respect to rank-based or fitness-based weights. Importantly, explicit inversion of the Fisher matrix is avoided, yielding practical, efficient algorithms.

The admissible range of learning rates that guarantee monotonic improvement in expected fitness is given by λ\lambda3, with monotonic improvement assured for any step-size in that interval. CMA-ES default learning rates, which are constant, are justified as they empirically fall within this range—and this monotonicity phenomenon is robust against the reparameterization of the covariance (Akimoto et al., 2012).

CMA-ES can be viewed through the lens of information-geometric expectation-maximization (EM): standard EM alternates λ\lambda4-projection (“E-step”) and λ\lambda5-projection (“M-step”), whereas CMA-ES replaces the full M-step with a single natural gradient step, effecting partial projection and thus smoothing progress (Akimoto et al., 2012, Hamano et al., 2024).

3. Algorithmic Structures, Extensions, and Scalability

CMA-ES has inspired a family of algorithmic developments addressing scalability and structure:

  • Limited-memory CMA-ES (LM-CMA-ES): Stores λ\lambda6 direction vectors to reconstruct Cholesky factors on the fly, reducing memory and runtime from λ\lambda7 and λ\lambda8 to λ\lambda9 and xi=mt+σtN(0,Ct),i=1,,λ.x_i = m^t + \sigma^t \mathcal{N}(0, C^t), \quad i=1,\dots,\lambda.0, respectively, at the cost of a slight reduction in covariance learning fidelity for very high-dimensional problems (xi=mt+σtN(0,Ct),i=1,,λ.x_i = m^t + \sigma^t \mathcal{N}(0, C^t), \quad i=1,\dots,\lambda.1) (Loshchilov, 2014).
  • Matrix-free CMA-ES (MF-CMA-ES): Eliminates even implicit matrix representations, instead maintaining a rolling archive of past normalized steps. Offspring are sampled as randomized linear combinations of these archive vectors, guaranteeing the exact CMA-ES search distribution in theory and yielding similar or superior performance with reduced computational overhead for large-scale optimization (Arabas et al., 31 Dec 2025).
Variant Memory Typical Max xi=mt+σtN(0,Ct),i=1,,λ.x_i = m^t + \sigma^t \mathcal{N}(0, C^t), \quad i=1,\dots,\lambda.2 Main Bottleneck
Full CMA-ES xi=mt+σtN(0,Ct),i=1,,λ.x_i = m^t + \sigma^t \mathcal{N}(0, C^t), \quad i=1,\dots,\lambda.3 xi=mt+σtN(0,Ct),i=1,,λ.x_i = m^t + \sigma^t \mathcal{N}(0, C^t), \quad i=1,\dots,\lambda.4 xi=mt+σtN(0,Ct),i=1,,λ.x_i = m^t + \sigma^t \mathcal{N}(0, C^t), \quad i=1,\dots,\lambda.5 Eigen
LM-CMA-ES xi=mt+σtN(0,Ct),i=1,,λ.x_i = m^t + \sigma^t \mathcal{N}(0, C^t), \quad i=1,\dots,\lambda.6 xi=mt+σtN(0,Ct),i=1,,λ.x_i = m^t + \sigma^t \mathcal{N}(0, C^t), \quad i=1,\dots,\lambda.7–xi=mt+σtN(0,Ct),i=1,,λ.x_i = m^t + \sigma^t \mathcal{N}(0, C^t), \quad i=1,\dots,\lambda.8 Archive ops
MF-CMA-ES xi=mt+σtN(0,Ct),i=1,,λ.x_i = m^t + \sigma^t \mathcal{N}(0, C^t), \quad i=1,\dots,\lambda.9 f(xi)f(x_i)0–f(xi)f(x_i)1 Archive assembly
  • Diagonal acceleration (dd-CMA-ES): Explicitly factors f(xi)f(x_i)2 as f(xi)f(x_i)3 and adaptively updates the diagonal f(xi)f(x_i)4 at higher rate, exploiting problem separability while retaining the ability to adapt full correlations. The diagonal's learning rate is adaptively damped based on f(xi)f(x_i)5’s condition number, ensuring stability and “overadditive” speedup on hybrid landscapes (Akimoto et al., 2019).
  • Limited-memory MA-ES (LM-MA-ES): Replaces f(xi)f(x_i)6 with transformation matrix f(xi)f(x_i)7 (dimension f(xi)f(x_i)8), but samples and updates only a small set of momentum vectors, achieving f(xi)f(x_i)9 storage and per-sample cost (Loshchilov et al., 2017).
  • PCA-assisted CMA-ES: Periodically performs principal component analysis on elite samples to project search into lower-dimensional subspaces, thus accelerating convergence and noise-filtering in high-dimensional, low-intrinsic-dimension problems (Mei et al., 2021).

4. Hyperparameter Adaptation, Noise Handling, and Variants

While CMA-ES is “almost parameterless” by construction, research demonstrates substantial gains from adaptive control of learning rates and other hyperparameters:

  • Online adaptation (self-CMA-ES): An auxiliary CMA-ES runs in tandem to adapt the primary’s ii0, ii1, and ii2 by maximizing a rank-based likelihood surrogate, yielding consistently faster covariance learning in large-population or difficult scenarios (Loshchilov et al., 2014).
  • Learning-rate adaptation (LRA-CMA-ES): The covariance and mean update strengths are modulated online to maintain a constant signal-to-noise ratio in the Fisher-metric, computed using blockwise moving averages in local coordinates. This enables robust, default-ii3 performance on multimodal and noisy landscapes, outperforming fixed-rate and population-size-adapted variants, especially under strong noise (Nomura et al., 2024, Nomura et al., 2023).
  • Noise handling: Adaptive re-evaluation strategies estimate per-generation the optimal number of resamplings to balance expected improvement against cost, outmatching fixed and static schemes in high-noise regimes (Dinu et al., 2024). Classical approaches include larger population sizes (PSA-CMA-ES), uncertainty handling via periodic reranking (UH-CMA-ES), and step-size damping.
  • Marginal and discrete extensions: Discrete CMA-ES replaces the Gaussian by exponential-family binomial or Ising-type models, using analogous natural-gradient updates on interaction parameters, supporting pairwise or higher-order interactions (Benhamou et al., 2018). The (1+1)-CMA-ES with margin addresses mixed-integer domains by margin-corrected sampling, ensuring exploration of discrete variables and preventing premature collapse (Watanabe et al., 2023).

5. Theoretical Analysis and Information-Geometric Perspective

Theoretically, the rank-ii4 CMA-ES update is rigorously a natural-gradient step for the expected fitness functional over the normal manifold ii5 with the Fisher metric; the rank-one evolution path term has been derived from a maximum-a-posteriori IGO (MAP-IGO) approach, where a suitable normal-inverse-Wishart prior introduces a bias along the cumulative direction, embodying a type of momentum (Akimoto et al., 2012, Hamano et al., 2024).

The invariance properties—monotonic transformation, rotation, scaling—directly result from parameterization and update structure:

  • Only the ranking of fitness values influences recombination.
  • Rotations (orthogonal transformations) leave the algorithm invariant due to adaptation in the data rather than fixed axes.
  • Variable scales are handled automatically by ii6, and any monotonic ii7 leaves the optimization trajectory unchanged (Loshchilov et al., 2016).

The range of admissible natural-gradient step-sizes for monotonic improvement on the expected fitness lower bound, as established in (Akimoto et al., 2012), rationalizes the empirical success of default constant step-sizes and explains observed monotonicity on practical test functions.

6. Applications, Extensions, and Impact

CMA-ES has achieved empirical and practical dominance in diverse deployment scenarios:

  • Hyperparameter optimization: For deep neural networks, CMA-ES enables efficient, parallel search across high-dimensional, ill-conditioned spaces, outperforming state-of-the-art Bayesian optimizers (e.g., Gaussian-process-based) when parallel resources are available (Loshchilov et al., 2016).
  • Quality diversity (QD) and illumination: CMA-ES mechanisms for self-adaptation, niching, and archiving have been combined into hybrid algorithms (e.g., CMA-ME), attaining both high coverage (“illumination”) across behavior spaces and strong per-niche fitness (Fontaine et al., 2019).
  • Surrogate-assisted optimization: Integrating local, dynamically built radial basis function models (CMA-SAO) can reduce function evaluations by up to ii8 while preserving convergence, especially on expensive or high-dimensional test problems (Khouzani et al., 22 May 2025).
  • Large-scale and limited-memory optimization: LM-CMA-ES, LM-MA-ES, and matrix-free approaches enable application of covariance-adapting evolution strategies to problems with ii9, where full covariance storage and update are infeasible (Loshchilov, 2014, Loshchilov et al., 2017, Arabas et al., 31 Dec 2025).

7. Limitations, Open Theoretical Questions, and Future Directions

  • Discrete and mixed-integer optimization: While continuous CMA-ES adapts naturally to most continuous problems, extension to discrete, combinatorial, or mixed domains requires nontrivial model changes and adaptive encoding, with associated computational overhead (Benhamou et al., 2018, Watanabe et al., 2023).
  • Rank-one vs rank-μ\mu0 interpretation: Information-geometric derivations for the rank-μ\mu1 update are mature, but only recently has the rank-one term been justified as a natural-gradient step under a prior, leading to novel momentum corrections in the mean update (Hamano et al., 2024).
  • Learning rate and population-size tuning: While learning-rate adaptation methods mitigate the need for expensive grid-search or manual selection, parameterization and stability near singular matrices or vanishing SNR remain delicate (Loshchilov et al., 2014, Nomura et al., 2023).
  • Scalability and constraint handling: While limited-memory and matrix-free variants address scaling, constraint-handling and noise-robustness in black-box contexts remain areas of active development.

Ongoing research in algorithmic variants, information geometry, and scalable statistics continues to extend the reach and theoretical guarantees of CMA-ES within optimization and machine learning. The unification of evolutionary adaptations with natural-gradient learning presents a powerful design pattern applicable far beyond any specific instance (Akimoto et al., 2012, Hamano et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Covariance Matrix Adaptation (CMA-ES).