CMA-ES: Covariance Matrix Adaptation

Updated 3 April 2026

CMA-ES is a stochastic, model-based evolutionary algorithm that adapts a multivariate normal distribution to efficiently explore ill-conditioned, non-separable optimization landscapes.
Its iterative updates use natural-gradient steps and cumulative evolution paths to adjust the mean, covariance, and step-size, ensuring invariance under rotation and scaling.
Extensions such as LM-CMA-ES, MF-CMA-ES, and PCA-assisted CMA-ES enhance scalability and efficiency for high-dimensional and noisy optimization problems.

Covariance Matrix Adaptation (CMA-ES) is a stochastic, model-based evolutionary algorithm for unconstrained optimization of non-linear, non-convex functions in continuous domains. The defining characteristic of CMA-ES is its adaptive search distribution—a multivariate normal whose covariance matrix is iteratively updated to capture the objective’s underlying structure, enabling efficient search in ill-conditioned, non-separable, and rugged landscapes. CMA-ES provides robust, invariance-rich performance across diverse application domains including machine learning hyperparameter optimization, engineering design, and the rapid illumination of behavioral spaces.

1. The CMA-ES Algorithmic Framework

At iteration $t$ , CMA-ES maintains (i) a mean vector $m^t\in\mathbb R^n$ , (ii) a positive-definite covariance matrix $C^t\in\mathbb R^{n\times n}$ , and (iii) a global step-size $\sigma^t>0$ . $\lambda$ candidate solutions are sampled as

$x_i = m^t + \sigma^t \mathcal{N}(0, C^t), \quad i=1,\dots,\lambda.$

After evaluating $f(x_i)$ for all $i$ , the top $\mu$ solutions, sorted by fitness, form weighted steps

$y_{i:\lambda} = (x_{i:\lambda} - m^t)/\sigma^t,$

where $m^t\in\mathbb R^n$ 0 is the $m^t\in\mathbb R^n$ 1-th best. The mean update uses recombination:

$m^t\in\mathbb R^n$ 2

Covariance adaptation is performed via: $m^t\in\mathbb R^n$ 3 The “evolution path” $m^t\in\mathbb R^n$ 4 cumulates successful steps to capture intergenerational correlations. Cumulative step-size adaptation (CSA) modulates $m^t\in\mathbb R^n$ 5 using a second path $m^t\in\mathbb R^n$ 6, which controls global search radius relative to the length of random walks in isotropic normal space. Default settings for population and learning rates ensure robust performance across a wide range of problems (Hansen, 2016).

Typical parameter settings:

Parameter	Default (dimension $m^t\in\mathbb R^n$ 7)	Role
$m^t\in\mathbb R^n$ 8	$m^t\in\mathbb R^n$ 9	Offspring count
$C^t\in\mathbb R^{n\times n}$ 0	$C^t\in\mathbb R^{n\times n}$ 1	Number recombined
$C^t\in\mathbb R^{n\times n}$ 2	$C^t\in\mathbb R^{n\times n}$ 3	Rank-one covariance update
$C^t\in\mathbb R^{n\times n}$ 4	$C^t\in\mathbb R^{n\times n}$ 5	Rank- $C^t\in\mathbb R^{n\times n}$ 6 update
$C^t\in\mathbb R^{n\times n}$ 7	$C^t\in\mathbb R^{n\times n}$ 8	Cumulation for $C^t\in\mathbb R^{n\times n}$ 9
$\sigma^t>0$ 0	$\sigma^t>0$ 1	Step-size path
$\sigma^t>0$ 2	$\sigma^t>0$ 3	Damping $\sigma^t>0$ 4

where $\sigma^t>0$ 5 (Hansen, 2016).

2. Information-Geometric Foundations and Theoretical Guarantees

CMA-ES is rigorously grounded in information geometry as a stochastic natural gradient ascent on the manifold of multivariate normals, equipped with the Fisher metric. Let $\sigma^t>0$ 6. The Fisher information matrix $\sigma^t>0$ 7 provides a local metric, and steepest ascent is given by the natural gradient $\sigma^t>0$ 8.

Under canonical $\sigma^t>0$ 9, the natural gradient direction for each sample decomposes as

$\lambda$ 0

The CMA-ES update rules for $\lambda$ 1 and $\lambda$ 2 exactly implement a gradient step along this direction with respect to rank-based or fitness-based weights. Importantly, explicit inversion of the Fisher matrix is avoided, yielding practical, efficient algorithms.

The admissible range of learning rates that guarantee monotonic improvement in expected fitness is given by $\lambda$ 3, with monotonic improvement assured for any step-size in that interval. CMA-ES default learning rates, which are constant, are justified as they empirically fall within this range—and this monotonicity phenomenon is robust against the reparameterization of the covariance (Akimoto et al., 2012).

CMA-ES can be viewed through the lens of information-geometric expectation-maximization (EM): standard EM alternates $\lambda$ 4-projection (“E-step”) and $\lambda$ 5-projection (“M-step”), whereas CMA-ES replaces the full M-step with a single natural gradient step, effecting partial projection and thus smoothing progress (Akimoto et al., 2012, Hamano et al., 2024).

3. Algorithmic Structures, Extensions, and Scalability

CMA-ES has inspired a family of algorithmic developments addressing scalability and structure:

Limited-memory CMA-ES (LM-CMA-ES): Stores $\lambda$ 6 direction vectors to reconstruct Cholesky factors on the fly, reducing memory and runtime from $\lambda$ 7 and $\lambda$ 8 to $\lambda$ 9 and $x_i = m^t + \sigma^t \mathcal{N}(0, C^t), \quad i=1,\dots,\lambda.$ 0, respectively, at the cost of a slight reduction in covariance learning fidelity for very high-dimensional problems ( $x_i = m^t + \sigma^t \mathcal{N}(0, C^t), \quad i=1,\dots,\lambda.$ 1) (Loshchilov, 2014).
Matrix-free CMA-ES (MF-CMA-ES): Eliminates even implicit matrix representations, instead maintaining a rolling archive of past normalized steps. Offspring are sampled as randomized linear combinations of these archive vectors, guaranteeing the exact CMA-ES search distribution in theory and yielding similar or superior performance with reduced computational overhead for large-scale optimization (Arabas et al., 31 Dec 2025).

Variant	Memory	Typical Max $x_i = m^t + \sigma^t \mathcal{N}(0, C^t), \quad i=1,\dots,\lambda.$ 2	Main Bottleneck
Full CMA-ES	$x_i = m^t + \sigma^t \mathcal{N}(0, C^t), \quad i=1,\dots,\lambda.$ 3	$x_i = m^t + \sigma^t \mathcal{N}(0, C^t), \quad i=1,\dots,\lambda.$ 4	$x_i = m^t + \sigma^t \mathcal{N}(0, C^t), \quad i=1,\dots,\lambda.$ 5 Eigen
LM-CMA-ES	$x_i = m^t + \sigma^t \mathcal{N}(0, C^t), \quad i=1,\dots,\lambda.$ 6	$x_i = m^t + \sigma^t \mathcal{N}(0, C^t), \quad i=1,\dots,\lambda.$ 7– $x_i = m^t + \sigma^t \mathcal{N}(0, C^t), \quad i=1,\dots,\lambda.$ 8	Archive ops
MF-CMA-ES	$x_i = m^t + \sigma^t \mathcal{N}(0, C^t), \quad i=1,\dots,\lambda.$ 9	$f(x_i)$ 0– $f(x_i)$ 1	Archive assembly

Diagonal acceleration (dd-CMA-ES): Explicitly factors $f(x_i)$ 2 as $f(x_i)$ 3 and adaptively updates the diagonal $f(x_i)$ 4 at higher rate, exploiting problem separability while retaining the ability to adapt full correlations. The diagonal's learning rate is adaptively damped based on $f(x_i)$ 5’s condition number, ensuring stability and “overadditive” speedup on hybrid landscapes (Akimoto et al., 2019).
Limited-memory MA-ES (LM-MA-ES): Replaces $f(x_i)$ 6 with transformation matrix $f(x_i)$ 7 (dimension $f(x_i)$ 8), but samples and updates only a small set of momentum vectors, achieving $f(x_i)$ 9 storage and per-sample cost (Loshchilov et al., 2017).
PCA-assisted CMA-ES: Periodically performs principal component analysis on elite samples to project search into lower-dimensional subspaces, thus accelerating convergence and noise-filtering in high-dimensional, low-intrinsic-dimension problems (Mei et al., 2021).

4. Hyperparameter Adaptation, Noise Handling, and Variants

While CMA-ES is “almost parameterless” by construction, research demonstrates substantial gains from adaptive control of learning rates and other hyperparameters:

Online adaptation (self-CMA-ES): An auxiliary CMA-ES runs in tandem to adapt the primary’s $i$ 0, $i$ 1, and $i$ 2 by maximizing a rank-based likelihood surrogate, yielding consistently faster covariance learning in large-population or difficult scenarios (Loshchilov et al., 2014).
Learning-rate adaptation (LRA-CMA-ES): The covariance and mean update strengths are modulated online to maintain a constant signal-to-noise ratio in the Fisher-metric, computed using blockwise moving averages in local coordinates. This enables robust, default- $i$ 3 performance on multimodal and noisy landscapes, outperforming fixed-rate and population-size-adapted variants, especially under strong noise (Nomura et al., 2024, Nomura et al., 2023).
Noise handling: Adaptive re-evaluation strategies estimate per-generation the optimal number of resamplings to balance expected improvement against cost, outmatching fixed and static schemes in high-noise regimes (Dinu et al., 2024). Classical approaches include larger population sizes (PSA-CMA-ES), uncertainty handling via periodic reranking (UH-CMA-ES), and step-size damping.
Marginal and discrete extensions: Discrete CMA-ES replaces the Gaussian by exponential-family binomial or Ising-type models, using analogous natural-gradient updates on interaction parameters, supporting pairwise or higher-order interactions (Benhamou et al., 2018). The (1+1)-CMA-ES with margin addresses mixed-integer domains by margin-corrected sampling, ensuring exploration of discrete variables and preventing premature collapse (Watanabe et al., 2023).

5. Theoretical Analysis and Information-Geometric Perspective

Theoretically, the rank- $i$ 4 CMA-ES update is rigorously a natural-gradient step for the expected fitness functional over the normal manifold $i$ 5 with the Fisher metric; the rank-one evolution path term has been derived from a maximum-a-posteriori IGO (MAP-IGO) approach, where a suitable normal-inverse-Wishart prior introduces a bias along the cumulative direction, embodying a type of momentum (Akimoto et al., 2012, Hamano et al., 2024).

The invariance properties—monotonic transformation, rotation, scaling—directly result from parameterization and update structure:

Only the ranking of fitness values influences recombination.
Rotations (orthogonal transformations) leave the algorithm invariant due to adaptation in the data rather than fixed axes.
Variable scales are handled automatically by $i$ 6, and any monotonic $i$ 7 leaves the optimization trajectory unchanged (Loshchilov et al., 2016).

The range of admissible natural-gradient step-sizes for monotonic improvement on the expected fitness lower bound, as established in (Akimoto et al., 2012), rationalizes the empirical success of default constant step-sizes and explains observed monotonicity on practical test functions.

6. Applications, Extensions, and Impact

CMA-ES has achieved empirical and practical dominance in diverse deployment scenarios:

Hyperparameter optimization: For deep neural networks, CMA-ES enables efficient, parallel search across high-dimensional, ill-conditioned spaces, outperforming state-of-the-art Bayesian optimizers (e.g., Gaussian-process-based) when parallel resources are available (Loshchilov et al., 2016).
Quality diversity (QD) and illumination: CMA-ES mechanisms for self-adaptation, niching, and archiving have been combined into hybrid algorithms (e.g., CMA-ME), attaining both high coverage (“illumination”) across behavior spaces and strong per-niche fitness (Fontaine et al., 2019).
Surrogate-assisted optimization: Integrating local, dynamically built radial basis function models (CMA-SAO) can reduce function evaluations by up to $i$ 8 while preserving convergence, especially on expensive or high-dimensional test problems (Khouzani et al., 22 May 2025).
Large-scale and limited-memory optimization: LM-CMA-ES, LM-MA-ES, and matrix-free approaches enable application of covariance-adapting evolution strategies to problems with $i$ 9, where full covariance storage and update are infeasible (Loshchilov, 2014, Loshchilov et al., 2017, Arabas et al., 31 Dec 2025).

7. Limitations, Open Theoretical Questions, and Future Directions

Discrete and mixed-integer optimization: While continuous CMA-ES adapts naturally to most continuous problems, extension to discrete, combinatorial, or mixed domains requires nontrivial model changes and adaptive encoding, with associated computational overhead (Benhamou et al., 2018, Watanabe et al., 2023).
Rank-one vs rank- $\mu$ 0 interpretation: Information-geometric derivations for the rank- $\mu$ 1 update are mature, but only recently has the rank-one term been justified as a natural-gradient step under a prior, leading to novel momentum corrections in the mean update (Hamano et al., 2024).
Learning rate and population-size tuning: While learning-rate adaptation methods mitigate the need for expensive grid-search or manual selection, parameterization and stability near singular matrices or vanishing SNR remain delicate (Loshchilov et al., 2014, Nomura et al., 2023).
Scalability and constraint handling: While limited-memory and matrix-free variants address scaling, constraint-handling and noise-robustness in black-box contexts remain areas of active development.

Ongoing research in algorithmic variants, information geometry, and scalable statistics continues to extend the reach and theoretical guarantees of CMA-ES within optimization and machine learning. The unification of evolutionary adaptations with natural-gradient learning presents a powerful design pattern applicable far beyond any specific instance (Akimoto et al., 2012, Hamano et al., 2024).