Evolutionary Kernel Search for GPs

Updated 19 April 2026

The paper demonstrates that evolutionary kernel search leverages genetic programming to automatically construct complex, PSD kernels for GPs, enhancing predictive accuracy.
The methodology employs strongly-typed grammars and multi-objective strategies like NSGA-II to balance accuracy, complexity, and computational cost in kernel discovery.
Experimental studies show that evolved kernels can match or surpass fixed kernel methods in applications such as sentiment analysis and time-series forecasting using metrics like RMSE and BIC.

Evolutionary kernel search for Gaussian Processes (GPs) is a methodology that leverages evolutionary algorithms—primarily genetic programming—to automatically discover, compose, and select covariance kernels suited to particular data or tasks. Unlike standard GP modeling, which relies on selecting from a small set of fixed kernel families and performing hyperparameter optimization, evolutionary kernel search explores a vast space of kernel structures, potentially yielding models of increased expressiveness, better predictive accuracy, or reduced complexity. The approach has seen substantial development, with variations in grammar design, evolutionary strategies, and multi-objective evaluation (Roman et al., 2019, Roman et al., 2019, Kronberger et al., 2013).

1. Kernel Representation and Grammar Constraints

Modern evolutionary GP kernel discovery employs strongly-typed grammars to encode both primitive (base) kernels and kernel composition rules. Kernels are represented as expression trees whose nodes correspond to algebraic or functional operators (e.g., sum, product, exponentiation) and whose leaves are parametrized base kernels. Primitives are chosen to guarantee positive semidefiniteness (PSD), typically comprising:

Squared Exponential (SE): $k_{\mathrm{SE}}(x, x') = \theta_0^2\,\exp(-\tfrac{1}{2} r^2)$ , $r = \|x - x'\| / \theta_\ell$
Matérn (e.g., 3/2, 5/2): $k_\nu(x, x') = \theta_0^2\,P(r)\exp(-\alpha r)$
Rational Quadratic (RQ): $k_{\mathrm{RQ}}(x, x') = \theta_0^2\,\left(1 + \frac{r^2}{2\alpha}\right)^{-\alpha}$
Other: Periodic, Linear, Constant, White Noise, γ-exponential (Roman et al., 2019, Kronberger et al., 2013, Roman et al., 2019)

Grammar rules dictate compositions via addition, multiplication, scaling, masking of input dimensions, and in some frameworks, function application (e.g., exponentiation or power), while preserving type safety and PSD status. For example:

$k(x,x') = ( k_{\mathrm{SE}}(x,x') + k_{\mathrm{PER}}(x,x') ) \times k_{\mathrm{RQ}}(x,x')$

This grammar-based approach guarantees all candidate kernels are valid for GP regression (Roman et al., 2019, Kronberger et al., 2013, Roman et al., 2019).

2. Evolutionary Search Procedures

Genetic programming (GP) orchestrates the search through populations of kernel trees over multiple generations, applying variation operators such as crossover and mutation:

Crossover: Subtree exchange between two parent kernels, typically under sum or product operators (preserving type and arity).
Mutation: Replacement of a subtree by a newly generated one; insertions, shrinkage, and operator replacement are possible.
Initialization: Typically via a probabilistic “grow” method constrained by grammar.
Population management: Population sizes vary (e.g., $N=38$ in (Roman et al., 2019), $N=141$ in (Roman et al., 2019)), with survivors carried over between generations based on fitness.
Selection: Multi-objective evolutionary strategies such as NSGA-II are used to select survivor subsets and maintain Pareto fronts.
Stagnation and Restart: If no relative improvement above a set threshold is achieved in any objective, populations are re-initialized to avoid premature convergence (Roman et al., 2019, Roman et al., 2019, Kronberger et al., 2013).

Typical settings involve population sizes in the range 40–150, generations 60–140, with crossover and mutation rates tuned for exploratory adequacy.

3. Multi-Objective Evaluation and Complexity Control

Fitness evaluation is multi-objective, reflecting both predictive performance and model tractability:

Predictive accuracy: Metrics include log marginal likelihood (LML), negative log predictive density (NLPD), or RMSE, e.g.,

$\mathrm{LML} = -\tfrac{1}{2}\mathbf{y}^{\top} K^{-1} \mathbf{y} - \tfrac{1}{2}\log|K| - \tfrac{n}{2}\log 2\pi$

$\mathrm{NLPD} = \frac{1}{n}\sum_i\left[-\frac{(f_i - \mu_i)^2}{2\sigma_i^2} - \tfrac{1}{2}\log\sigma_i^2 - \tfrac{1}{2}\log 2\pi\right]$

Model complexity: Quantified using BIC, with penalty $q\log n$ ( $r = \|x - x'\| / \theta_\ell$ 0 = number of hyperparameters). Additional metrics include expression tree size and depth to control for structural bloat (Roman et al., 2019, Kronberger et al., 2013).
Computational burden: Wall-clock time for fitting kernels is directly included as a selection objective (Roman et al., 2019).

Optimization is nested. For every kernel candidate, hyperparameters are tuned (e.g., Powell’s method with multi-start, L-BFGS), followed by fitness scoring. Final model selection is commonly performed via LML maximization on the Pareto archive (Roman et al., 2019, Roman et al., 2019).

4. Ensuring Positive Semidefiniteness

Maintaining valid GP kernels through arbitrary evolution is critical. Approaches include:

Restriction to PSD-preserving primitives and algebraic operators (sum, product, scaling, exponentiation with positive exponents).
Runtime screening: After GP operations, sampled Gram matrices $r = \|x - x'\| / \theta_\ell$ 1 are assembled; kernels yielding $r = \|x - x'\| / \theta_\ell$ 2, $r = \|x - x'\| / \theta_\ell$ 3, or any negative eigenvalue are penalized or rejected (BIC = ∞) (Roman et al., 2019).
Masking enables partitioned structure in high-dimension kernels by zeroing unused inputs, further aiding interpretability and flexibility (Kronberger et al., 2013).

These constraints, combined with fast PSD screening, enable broad exploration while maintaining mathematical rigor (Roman et al., 2019, Kronberger et al., 2013).

5. Evolved Kernel Structures and Task-Specific Patterns

Evolutionary search frequently discovers hybrid kernel architectures combining smooth (RBF/SE/Matern) components with periodic and rational-quadratic elements:

For sentiment regression: kernels such as $r = \|x - x'\| / \theta_\ell$ 4 incorporate large-scale trends and finer oscillatory structure (Roman et al., 2019).
Composite kernels on time series often match or exceed hand-tuned counterparts in expressivity:
- Example from Mauna Loa CO $r = \|x - x'\| / \theta_\ell$ 5: $r = \|x - x'\| / \theta_\ell$ 6, achieving near-identical performance to expert-designed kernels (Kronberger et al., 2013).

A plausible implication is that evolutionary methods systematically exploit multiple data regularities through kernel composition, with transferability observed across related domains (e.g., an anger kernel applied to other sentiment classes) (Roman et al., 2019).

6. Experimental Validation and Quantitative Outcomes

Studies have benchmarked evolutionary kernel search against fixed-kernel GP and other composite search platforms:

Sentiment analysis: On SemEval-2007 headlines, genetically-evolved kernels matched or outperformed SE, Matern, and linear baselines in PCC and NLPD, with MOECov ranking top or near-top in most tasks (Roman et al., 2019). Statistical tests affirm competitive or superior predictive quality.
Time-series extrapolation: On a 13-series benchmark (e.g., airline, CO $r = \|x - x'\| / \theta_\ell$ 7, solar irradiance), EvoCov achieved mean standardized RMSE of 1.951 (2nd-best), while using approximately half as many hyperparameters as non-evolutionary composite search (e.g., ABCD_accuracy) (Roman et al., 2019).
CO $r = \|x - x'\| / \theta_\ell$ 8 trend modeling: Evolved kernels delivered RMSE and log-likelihood nearly identical to the widely cited manual kernel, with strong correlation on test data (Kronberger et al., 2013).

Complexity control via BIC and Pareto fronts yields compact expressions. The performance advantage is robust to alternative target kernels and extends to transfer scenarios. However, in unrestricted search spaces, tree bloat and computational cost remain limiting factors (Roman et al., 2019, Kronberger et al., 2013).

While evolutionary algorithms target compositionally rich kernel spaces, fixed-candidate kernel selection strategies have also emerged:

The Automatic Kernel Search (AKS) algorithm in two-stage GPR restricts the candidate set to a small dictionary (e.g., RBF, Matérn-3/2, Matérn-1/2) and employs a statistically-motivated misspecification test (based on a model error bound) to select kernels with maximal well-specified fit probability (Zhao et al., 2024).
Unlike grammar-based GP, AKS scales by combining subsampling, warm-start hyperparameter optimization, and efficient misspecification checking, achieving lower runtime while sacrificing expressiveness (Zhao et al., 2024).

A plausible implication is that for high-dimensional or resource-constrained settings, dictionary-based search may be preferable, with evolutionary search reserved for cases demanding maximal flexibility and custom structural discoveries.

8. Challenges, Limitations, and Future Directions

Major challenges include the vast search space of expression trees, identification and penalization of bloat, lack of absolute PSD guarantees outside grammar-compliant spaces, and nested (costly) hyperparameter optimization (Roman et al., 2019, Kronberger et al., 2013). Structural complexity grows rapidly, necessitating explicit parsimony penalties or operator constraints. The overall computational cost, scaled as $r = \|x - x'\| / \theta_\ell$ 9 for $k_\nu(x, x') = \theta_0^2\,P(r)\exp(-\alpha r)$ 0 population, $k_\nu(x, x') = \theta_0^2\,P(r)\exp(-\alpha r)$ 1 generations, and $k_\nu(x, x') = \theta_0^2\,P(r)\exp(-\alpha r)$ 2 data points, currently restricts applicability to moderate-scale tasks (Kronberger et al., 2013, Roman et al., 2019).

Extensions under consideration include grammar enrichment (e.g., change-point, input partitioning, windowed kernels), multi-objective search trading off predictive error and complexity more explicitly, and adaptations for streaming or online environments (Roman et al., 2019). The efficacy of transfer and the integration of domain-specific inductive biases are also open questions.

References

Roman et al., "Sentiment analysis with genetically evolved Gaussian kernels" (Roman et al., 2019)
Roman et al., "Evolving Gaussian Process kernels from elementary mathematical expressions" (Roman et al., 2019)
Kronberger & Kommenda, "Evolution of Covariance Functions for Gaussian Process Regression using Genetic Programming" (Kronberger et al., 2013)
Zhang et al., "Efficient Two-Stage Gaussian Process Regression Via Automatic Kernel Search and Subsampling" (Zhao et al., 2024)