GP-Based Feature Engineering

Updated 27 November 2025

GP-based feature engineering is a method that uses genetic programming to evolve interpretable symbolic expressions and complex features for improved predictive performance.
It integrates vectorial GP and segment optimization techniques to automatically tailor features to data structures, ensuring robustness and enhanced generalization.
Advancements like domain knowledge integration and sharpness-aware minimization further improve feature interpretability and resistance to overfitting.

GP-based feature engineering denotes the use of genetic programming (GP) for the automatic construction, selection, and optimization of features, encompassing both interpretable transformations and complex symbolic representations. Within this paradigm, GP algorithms search over the space of mathematical expressions or programs that define features, often under constraints of model complexity, generalization error, or domain-specific structure. Recent advances include vectorial GP for segment-aware feature extraction, multi-task and knowledge-sharing architectures, sharpness-aware minimization to improve generalization, and integration with external knowledge sources such as LLMs.

1. Core Representations and Architectures

GP-based feature engineering encompasses variable representations. The canonical approach encodes each feature as a symbolic expression (tree) over the dataset’s input variables and, optionally, constants. For instance, each constructed feature in the feature construction schemes (FCS) of (Virgolin et al., 2019) corresponds to a rooted tree assembled from function nodes—arithmetic, non-linear, and protected operators—and terminal nodes comprising raw features or ephemeral random constants. Complexity is typically regulated via constraints such as maximum tree height (e.g., $h \in \{2, 4\})$ .

Vectorial GP (Vec-GP) extends this schema to natively handle both scalar and vector-valued inputs by "lifting" standard operations (e.g., $+$ , $-$ , $\times$ , $/$ ) to be component-wise over inputs $v, w \in \mathbb{R}^d$ : $(v \odot w)_k = v_k \odot w_k$ for $k = 1, \ldots, d$ (Fleck et al., 2023). Scalar constants and expressions are automatically broadcast across vector dimensions. Crucially, Vec-GP supports segment-based aggregation functions, enabling the conversion of a vector to a scalar via an aggregation (sum, mean, max, etc.) over a contiguous window defined by integer endpoints $(i,j)$ .

In multi-task or transfer learning settings, as in the KSMTGP architecture (Bi et al., 2020), individuals comprise multiple trees: a "common" tree encoding knowledge shared across tasks and task-specific trees for individualized feature learning. Features for each task are the concatenation of the common and specific outputs.

M6GP and related multi-tree, multi-objective approaches (Batista, 27 Mar 2025) further generalize by evolving individuals as ordered sets of symbolic trees, each producing an engineered feature.

2. Segment-Based and Structured Feature Optimization

A salient innovation is the explicit optimization of segment parameters when features aggregate substructures of vector-valued signals (e.g., time-series windows). In Vec-GP, every aggregation node is parameterized by a window $(i, j)$ , which must be jointly optimized with the symbolic structure and continuous constants (Fleck et al., 2023). This results in a bi-level search problem:

$\max_{T, \theta, I} \; \mathrm{Fitness}(T, \theta, I) = -\mathrm{Error}_{\text{dataset}}(T(x; \theta, I))$

Window optimization (the "Segment Optimization Problem," SOP) is challenging due to its integer, non-convex nature. Two classes of SOP strategies are used:

Random Sampling: Uniformly samples windows $(i,j)$ within valid indices, tracking and returning the best candidate over a fixed evaluation budget.
Guided Sampling: Approximates discrete gradients of the fitness landscape with respect to $i$ and $j$ , performs an integer-valued "gradient step," samples from a local neighborhood, and updates the best window found.

Empirical results indicate that simple random sampling generally outperforms guided/gradient-based approaches in terms of convergence speed and reliability, as the latter are prone to local optima—especially when the discrete gradient signal is weak or rounded to zero (Fleck et al., 2023).

Segment optimization is commonly embedded within specialized GP mutation operators ("MutateSegment") that focus search on aggregation window parameters during evolution.

3. Optimization Objectives and Overfitting Control

Optimization criteria in GP-based feature engineering are multifaceted. Predictive performance is typically assessed via cross-validated error, F1-score, or $R^2$ . Complexity constraints are handled via explicit limits (e.g., maximum tree height) or as an additional minimization objective (node count, model size).

Recent advances leverage PAC-Bayesian generalization bounds to inspire new regularization measures. Sharpness-aware minimization (SAM) penalizes feature programs whose loss landscapes are sensitive to small perturbations in semantic space, favoring flat minima and enhancing generalization, especially in the presence of limited data or label noise (Zhang et al., 11 May 2024). SAM-GP integrates this objective with traditional cross-validated error in a multi-objective EA framework (e.g., NSGA-II), often augmented with a sharpness-reduction layer that regularizes output values during inference for additional robustness.

Empirical comparisons confirm that the introduction of sharpness as a secondary objective dramatically improves generalization performance on real-world regression benchmarks, surpassing traditional complexity-based controls and matching or surpassing tuned tree-ensemble learners (Zhang et al., 11 May 2024).

4. Integration of Domain Knowledge and Automated Feature Seeding

A current frontier involves the incorporation of domain-specific knowledge to seed or guide the evolutionary process. The feature engineering pipeline in (Batista, 27 Mar 2025) uses LLMs (e.g., GPT-4o) to generate candidate feature transformations and combinations from only the feature names and task objective. These candidate features are computed and appended to the dataset, after which standard GP-based symbolic regression proceeds on the augmented feature space.

This two-stage procedure imposes negligible incremental computational cost but frequently accelerates convergence and, in approximately one-third of experimental cases, yields statistically significant improvements in test performance. The risk of misguidance due to non-informative ("hallucinated") features is low, as GP evolution subsequently selects or ignores provided transformations as dictated by fitness.

Importantly, the approach is data-privacy preserving, as the LLM never accesses real data values, only metadata.

5. Interpretability, Compactness, and Downstream Utility

A central benefit of GP-based feature engineering is the production of explicit, human-interpretable symbolic features. Explicit bounds on tree height or node count yield compact formulas, facilitating 2D visualization and direct analysis of model behavior. Experiments with highly-constrained FCS pipelines ( $K=2$ features, $h=2$ or $4$) demonstrate that even two evolved features can match or outperform the full feature set for naive Bayes or linear regression on the majority of tasks, and preserve accuracy for SVM, random forest, and XGBoost in a substantial fraction of cases (Virgolin et al., 2019).

Compactness also mitigates overfitting and supports model explainability, essential for glass-box ML workflows. Recommendations from (Virgolin et al., 2019) advocate prioritizing interpretability (low $h$ ), especially for downstream models whose native inductive bias is limited; using more sophisticated search variants (GP-GOMEA, linkage-learning) when evaluation budgets allow; and always visualizing the behavior of models in the engineered feature space.

6. Extensions, Limitations, and Open Challenges

Open research areas and limitations identified include:

Window Optimization Scalability: Integer segment search remains computationally demanding, especially as vector dimension increases. Employing meta-heuristics, surrogate models, or integer relaxation may alleviate this bottleneck (Fleck et al., 2023).
Multiple or Weighted Segments: Current Vec-GP aggregates over single contiguous windows. Allowing multiple (possibly overlapping) or softly weighted windows could greatly expand representational capacity.
Robust Complexity Objectives for Regression: While multi-objective schemes for size and classification performance are well established, complexity-accuracy trade-offs in symbolic regression remain unsolved (Batista, 27 Mar 2025).
Beyond Arithmetic Function Sets: Expanding the set of permissible mathematical operators may improve exploitation of knowledge-seeded features and representational expressivity.
Knowledge-Sharing and Multi-task Evolution: Explicit co-evolution of shared and task-specific components in multitask contexts yields transferability and regularization (KSMTGP), yet the design of optimal sharing mechanisms remains an active topic.
Explainability vs. Predictive Performance: Trade-offs persist, particularly for strong nonlinear models, between maximizing informativeness and retaining interpretability.
Sharpness Estimation Overhead: While beneficial for generalization, sharpness-aware GP introduces additional runtime (approx. 3× over standard GP), although it remains more efficient than certain complexity-based regularization approaches (Zhang et al., 11 May 2024).

Plausible future directions include interactive co-evolution with LLMs, kernelized or deep-network generalizations, and surrogate-assisted segment optimization.

7. Empirical Evaluation and Benchmarking

Evaluation methodologies are highly protocolized: standardized train/test splitting, repeated experiments across random seeds, nested cross-validation during evolution for unbiased fitness assessment, and rigorous statistical analysis (Wilcoxon, Holm, Benjamini–Hochberg corrections) (Virgolin et al., 2019, Fleck et al., 2023, Zhang et al., 11 May 2024, Batista, 27 Mar 2025). Metrics encompass $R^2$ , macro-F1, RMSE/MAE, and run-length distributions to convergence.

Comparative baselines span classical ML models (SVM, random forest, XGBoost), wrapper-based GP, parsimonious and complexity-regularized variants, and state-of-the-art complexity measurement frameworks (Rademacher, Tikhonov, WCRV, IODC, etc.). Benchmarks span diverse domains, including high-dimensional genomics and complex time series.

Empirical findings consistently validate the competitive or superior performance of modern GP-based feature engineering approaches, both in predictive power and in producing interpretable, compact representations (Zhang et al., 11 May 2024, Fleck et al., 2023, Virgolin et al., 2019, Batista, 27 Mar 2025). Key observations include the strong performance of random window sampling for segment selection, the surprisingly high effectiveness of complexity-constrained features, the efficacy of sharpness-aware and ensemble methods for improved generalization, and the demonstrable benefits of integrating language-model-generated domain knowledge.

References

(Virgolin et al., 2019) On Explaining Machine Learning Models by Evolving Crucial and Compact Features
(Fleck et al., 2023) Vectorial Genetic Programming – Optimizing Segments for Feature Extraction
(Zhang et al., 11 May 2024) Sharpness-Aware Minimization for Evolutionary Feature Construction in Regression
(Batista, 27 Mar 2025) Embedding Domain-Specific Knowledge from LLMs into the Feature Engineering Pipeline
(Bi et al., 2020) Learning and Sharing: A Multitask Genetic Programming Approach to Image Feature Learning