Knowledge-Based Feature Engineering

Updated 8 December 2025

Knowledge-based feature engineering is the process of integrating domain expertise into constructing new, semantically informed features for machine learning models.
It leverages LLMs as knowledge oracles to propose candidate feature transformations that reflect true domain semantics, reducing the need for exhaustive random searches.
Coupled with evolutionary algorithms like genetic programming, the approach accelerates model convergence, enhances predictive performance, and preserves data privacy.

Knowledge-based feature engineering is the process by which domain-specific expert knowledge is operationalized within the feature construction pipeline to yield new, semantically informed features for machine learning models. The central motivation is to overcome the limitations of purely data-driven or random-evolution approaches by prescribing candidate transformations that reflect the natural structure or known theoretical principles of the domain, rather than relying on exhaustive or blind search. Modern instantiations increasingly leverage LLMs as knowledge oracles to automate the suggestion of plausible feature transformations, allowing for domain knowledge injection even when raw data cannot be shared. Downstream, evolutionary algorithms—such as genetic programming—compress, select, and optimize both the original and LLM-generated features to produce compact, high-performance models. Rigorous benchmarking shows computational advantages (faster convergence), minimal performance degradation, and superior results in domains where LLMs possess strong prior knowledge. This article provides a comprehensive technical exposition of knowledge-based feature engineering as developed in recent literature, with special emphasis on LLM-enabled pipelines (Batista, 27 Mar 2025).

1. Formalization of Knowledge-Based Feature Engineering

Let $D = \{(x^{(i)}, y^{(i)})\}_{i=1}^n$ be a tabular dataset of $n$ samples, $m$ original features $x^{(i)} \in \mathbb{R}^m$ , and target outputs $y^{(i)}$ . The objective is to construct a mapping

$\varphi: \mathbb{R}^m \rightarrow \mathbb{R}^k, \quad k \geq m,$

where

$\varphi(x) = [x, h_1(x), \ldots, h_p(x)]$

and each $h_j: \mathbb{R}^m \rightarrow \mathbb{R}$ is a new feature function informed by domain knowledge. The set of candidate knowledge-derived transformations is denoted $\mathcal{H}_K = \{h_j\}_{j=1}^p$ and is explicitly prescribed (not blindly generated). For LLM-based pipelines, $\mathcal{H}_K$ is extracted from an LLM based solely on feature names and task objective, without access to raw data, i.e.

$\mathcal{H}_K = LLM(K) = \{h_1, \ldots, h_p\}, \quad K = (\text{feature-name list}, \text{task objective}).$

The augmented dataset becomes $X' = [X, H_K(X)] \in \mathbb{R}^{n \times (m+p)}$ . The downstream predictor $f: \mathbb{R}^k \rightarrow \mathcal{Y}$ is then trained on $\varphi(x)$ .

This approach contrasts with standard evolutionary pipelines, which typically sample feature space randomly under finite computational budget, leading to extensive trial-and-error and slow convergence due to lack of semantic priors.

2. Motivation and Theoretical Rationale

Evolutionary computation (EC)—notably Genetic Algorithms and Genetic Programming—has established utility in feature selection and construction due to its ability to discover nonlinear interactions and mathematically interpretable models. Classical EC begins from random initialization within a search space defined by primitive functions over available features; in early generations, this often yields candidate features that have little structural correspondence to true domain semantics. As a consequence, EC pipelines require large numbers (often thousands) of fitness evaluations before identifying promising transformations, incurring high computational cost.

Practitioners in applied domains routinely mitigate this via hand-crafted, theoretically motivated feature construction (e.g., ratios, aggregates from physics or biology), which jump-starts EC by narrowing the search space. The knowledge-based approach reconciles these two paradigms: LLMs serve as “feature suggestion” oracles, using feature names/objectives to propose combinations, and EC then refines, selects, and compresses the union of original and LLM-proposed features. This injection of coarse domain knowledge into the EC loop sharply reduces redundant exploration and accelerates model convergence.

3. Two-Stage Methodology: LLM-Based Feature Construction and Genetic Programming

3.1 LLM-Driven Transformation Discovery

The LLM operates in a zero-shot modality, receiving only the feature names and target objective:

Step 1: Request most relevant features for the task.
Step 2: Request plausible transformations/combinations from the set identified in Step 1.

The LLM outputs symbolic feature definitions, e.g. $h_1 = \log(\text{age})$ , $h_2 = \text{cementitious\_materials\_sum} = \text{cement} + \text{slag} + \text{fly\_ash}$ , $h_3 = \text{water\_cement\_ratio}$ , etc.

These transforms are computed over the existing train/test splits. Hallucinated or spurious features may be introduced, but will be filtered at subsequent optimization stages.

3.2 EC Pipeline Integration and Optimization

The augmented feature tensor $X'$ is input to a wrapper-based EC algorithm—past work utilized M3GP (single-objective multi-tree GP); the present work introduces M6GP (Multiobjective Multidimensional GP).

M6GP evolves a population $T = \{t_1, \ldots, t_d\}$ , where each $t_i$ is a symbolic expression over terminals (original and LLM-proposed features) and certain arithmetic operators, delivering a set of derived feature constructs. Each $T$ is scored for predictive fitness using a downstream learner (e.g., Ridge Regression, Random Forest), evaluated via k-fold cross-validation:

For regression: $f_1(T) = \text{CV}_2 \text{RMSE}(f \circ \varphi)$ , $f_2(T) = \text{CV}_2 \text{MAE}(f \circ \varphi)$ .
For classification: $f_1(T) = \text{CV}_2 \text{WAF}(f \circ \varphi)$ , $f_2(T) = |T|$ (model size).

Selection proceeds via double-tournament on Pareto rank and crowding, with crossover/mutation/elitism. M6GP penalizes bloat by optimizing for both error and complexity, unlike M3GP which is single-objective.

4. Empirical Benchmarking and Quantitative Results

Experiments were conducted across 11 datasets—six regression (CSS, PM, etc.), five classification (IM10, etc.)—with feature cardinality $m \in [6, 36]$ , and LLM-proposed augmentations $p \in [2, 13]$ . Models were compared using Ridge, Decision Tree, and Random Forest backends, as well as GP-based feature construction (M3GP, M6GP).

Representative results for the CSS regression dataset: | Model | RMSE (Base) | RMSE (+LLM) | $p$ -value | |---------------|-------------|-------------|--------------| | RF | 6.195 | 5.744 | $<$ 0.001 | | Ridge | 10.432 | 6.901 | $<$ 0.001 | | M3GP-Ridge | 6.146 | 5.897 | 0.004 | | M6GP-Ridge | 6.715 | 6.377 | $<$ 0.001 |

Convergence analysis showed that, with LLM-derived features, EC beats the Ridge-LLM baseline in only 13 generations, vs. 33 generations required without LLM augmentation—a more than twofold speedup.

In classification (IM10), baseline RF-WAF $=0.907$ improved to $0.921$ ( $p<0.001$ ) with LLM features; M3GP-RF reached $0.934$.

Across all 77 test cases, LLM augmentation improved performance in 22, with only one showing degradation.

5. Advantages, Limitations, and Failure Modes

Observed strengths:

Strongest gains occurred in domains with well-understood semantics (e.g. concrete mix design, biomedical markers, geospatial bands) where LLMs possess robust prior knowledge.
Computational efficiency was improved; convergence required fewer fitness evaluations.
Privacy is preserved since LLMs operate with feature names/objectives only, avoiding exposure of sensitive data.

Limitations:

For obscure or noisy domains outside LLM knowledge, proposed features may be neutral or introduce noise; however, downstream GP selection remains robust and typically discards spurious constructs.
Risk of LLM hallucinations is nonzero, as is sensitivity to prompt design.
Sample-level augmentation is not addressed; features are constructed globally from metadata.
Highly specialized or private datasets may fall outside LLM training distribution.

6. Extensions and Future Research Directions

Recommended avenues for future work include:

Enriching LLM prompts with examples or retrieval-augmented generation (RAG) from external ontologies.
Dynamic, human-in-the-loop verification of LLM-proposed features.
Advanced multi-objective modeling that penalizes symbolic complexity or other structural metrics in regression/classification.
End-to-end fine-tuning of LLMs tailored specifically for feature construction and selection tasks.
Fusion with active and adaptive feature selection mechanisms.

7. Broader Implications

Embedding domain knowledge via LLMs prior to EC-based feature engineering constitutes a pragmatic methodology for reducing computational search space, improving predictive accuracy, and enhancing interpretability in tabular modeling pipelines. Empirical results confirm that computational resources are more efficiently allocated, and test accuracy improves—particularly where problem structure is accessible to LLMs—while privacy concerns are substantively mitigated (Batista, 27 Mar 2025).

PDF Markdown Chat (Pro)

References (1)

Embedding Domain-Specific Knowledge from LLMs into the Feature Engineering Pipeline (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Knowledge-Based Feature Engineering.