LLM-powered Feature Engineering

Updated 16 October 2025

LLM-powered feature engineering is a process that uses large language models to generate, transform, and select semantically rich features for predictive tasks.
It employs iterative workflows with context assembly, chain-of-thought reasoning, and adaptive feedback loops to optimize model performance in diverse applications.
Hybrid frameworks integrate LLM-driven semantic scoring, embedding enrichment, and fairness controls to enhance interpretability, accuracy, and operational efficiency.

LLM-powered feature engineering is an emerging subfield in automated machine learning (AutoML) that leverages the general reasoning and domain knowledge of LLMs to automate, optimize, and enrich the process of generating, selecting, and transforming features for downstream predictive tasks. Unlike classical methods which rely on brute-force search or hand-crafted transformation rules, LLM-driven approaches integrate contextual data understanding, chain-of-thought reasoning, and adaptive feedback loops, resulting in semantically meaningful, interpretable, and high-quality features across diverse domains.

1. Core Principles and Conceptual Frameworks

LLM-powered feature engineering is predicated on the notion that LLMs can ingest not only raw tabular data but also natural language descriptions, metadata, and sample statistics—using these to propose novel feature transformations and selection strategies. Noteworthy frameworks such as CAAFE (Hollmann et al., 2023), FeRG-LLM (Ko et al., 30 Mar 2025), LLM-FE (Abhyankar et al., 18 Mar 2025), ELLM-FT (Gong et al., 25 May 2024), and REFeat (Han et al., 25 Jun 2025) formalize feature engineering as an iterative search or program synthesis task:

Given a dataset $\mathcal{D} = \{(x_i, y_i)\}_{i=1}^N$ , the goal is to find a transformation $\mathcal{R} : \mathcal{X} \to \mathcal{X}'$ such that the performance of a model $f$ on $y$ improves: $\max_{\mathcal{R}} \mathcal{E}(f^*(\mathcal{R}(\mathcal{X}_{val})), \mathcal{Y}_{val})$ subject to $f^* = \arg\min_{f} L_f(f(\mathcal{R}(\mathcal{X}_{train})), \mathcal{Y}_{train})$ (Abhyankar et al., 18 Mar 2025).
LLMs act as both creative generators and evolutionary optimizers, proposing candidate feature transformations evaluated with empirical downstream metrics (e.g., ROC AUC, RMSE), and iteratively refining the search space based on validation performance and prior successes.

2. Iterative Feature Generation Workflows

Characteristic LLM-powered feature engineering workflows operate through sequential, feedback-driven loops:

The process begins with context assembly—dataset description, column names, types, missing value rates, and sample rows—fed as a prompt to the LLM.
The LLM generates executable code snippets for feature transformations, which are then applied to the dataset.
A downstream model (e.g., TabPFN, XGBoost, Random Forest) trains and evaluates the transformed data, and only transformations yielding a performance lift ( $P' > P$ ) are retained (Hollmann et al., 2023).
Frameworks such as LLM-FE (Abhyankar et al., 18 Mar 2025) and ELLM-FT (Gong et al., 25 May 2024) extend this loop with evolutionary and reinforcement learning strategies—using multi-population buffers, reward-driven selection, and in-context demos from successful prior transformations to guide LLM proposals.

3. Structured Reasoning and Diversity: Beyond Linear Transformations

A critical advancement is the explicit harnessing of multiple reasoning paradigms (deductive, inductive, abductive, analogical, causal, counterfactual):

REFeat (Han et al., 25 Jun 2025) demonstrates that LLMs biased toward simple or repetitive transformations can be steered using bespoke meta-prompts for each reasoning mode. An adaptive multi-armed bandit controller dynamically selects reasoning types for each iteration, updating estimated value $Q_t(r)$ and focusing on modes yielding maximal validation gains.
FeRG-LLM (Ko et al., 30 Mar 2025) and LFG (Zhang et al., 4 Jun 2024) showcase chain-of-thought (CoT) and tree-of-thought (ToT) prompting, causing the LLM to articulate transparent rationales for new features before synthesizing executable code, thus increasing interpretability and the complexity/diversity of engineered features.
Empirical studies reveal that such reasoning-driven pipelines outperform both conventional AutoFE and generic LLM prompt baselines with respect to predictive accuracy, feature complexity, and semantic diversity.

Framework	Reasoning Paradigm	Adaptive Selector	Empirical Gain (mean)
REFeat	Deductive, Inductive, etc.	Bandit	5-5.6% acc. lift
FeRG-LLM	Chain-of-Thought	DPO Feedback	Runner-up/top on 14 datasets

4. Feature Selection: Semantic Scoring and Ensemble Methods

LLMs have exhibited the ability to perform feature selection, ranking, and importance scoring without access to training labels or values:

LLM-Select (Jeong et al., 2 Jul 2024) formalizes methods including LLM-Score (direct importance score), LLM-Rank (ordered list), and LLM-Seq (sequential selection), relying on LLM’s pretraining and natural language understanding.
Performance comparisons across small and large datasets show that GPT-4’s zero-shot feature scoring is competitive with classical data-driven techniques (LASSO, MRMR), even outperforming them in high-dimensional, expensive-to-collect domains (e.g., MIMIC-IV).
The hybrid LLM4FS strategy (Li et al., 31 Mar 2025) combines LLM context-driven reasoning with traditional feature selection such as random forest or sequential selection, feeding example data into the LLM and leveraging its semantic understanding for improved AUROC and efficiency.

5. Integrations: Embedding-Based Enrichment and Automated Agents

LLM-powered engineering extends beyond classic tabular forms:

Embedding-based enrichment (Kasneci et al., 3 Nov 2024) transforms tabular instances into textual formats and generates contextual embeddings with models such as RoBERTa or GPT-2. Principle Component Analysis (PCA) and feature importance selection distill these to augment traditional features. Ablation studies show that ensemble models (XGBoost, CatBoost) see notable accuracy and F1 score improvements, especially on imbalanced datasets.
Agent architectures like Agent0 (Škrlj et al., 25 Jul 2025) deploy interconnected LLM agents—sentinels for extraction, architects for prompt refinement, oracles for evaluation—within closed dynamic feedback loops that autonomously discover, extract, and optimize multi-value features from text for recommender systems, using metrics like Relative Information Gain (RIG).

6. Cross-Domain Applications, Limitations, and Fairness Controls

These pipelines have demonstrated broad utility, with documented applications in healthcare (clinical notes, diagnostic prediction), finance (credit scoring, risk analysis), venture capital (startup success prediction), and ML monitoring (interpretable reporting via cognitive architectures (Bravo-Rocca et al., 11 Jun 2025)):

Venture capital studies (Ozince et al., 5 Jul 2024, Kumar et al., 9 Sep 2025) use LLMs to transform unstructured founder information into semantic features (categorical, continuous, textual embeddings), leveraging ensemble learning (XGBoost + Random Forest + Linear Regression) to achieve precision rates far above random baselines and interpretable sensitivity analysis.
Fairness-aware systems such as FairAgent (Dai et al., 5 Oct 2025) automate data preprocessing, feature transformation, and bias mitigation. The LLM semantically analyzes features, identifies sensitive attributes, proposes fairness-preserving transformations, and optimizes model objectives:

$\min_{\theta} \Bigg[ \mathcal{L}(\theta) + \lambda |\mathrm{FairnessMetric}(\theta) - \tau| \Bigg]$

where $\mathrm{FairnessMetric}(\theta)$ may represent demographic parity (

$|P(\hat{Y}=1 \mid A=0) - P(\hat{Y}=1 \mid A=1)|$

); automation reduces development time and technical barriers.

Limitations documented include:

Hallucination of non-relevant features by the LLM (mitigated by retrieval-augmented generation (Chandra, 15 Mar 2025), adaptive controller strategies (Han et al., 25 Jun 2025), downstream feedback).
Computational scalability concerns in tree-based or iterative approaches (Zhang et al., 4 Jun 2024, Abhyankar et al., 18 Mar 2025).
Data privacy concerns when feeding real samples to cloud-hosted LLMs (Li et al., 31 Mar 2025); approaches using only feature names and objectives avoid these issues (Batista, 27 Mar 2025).

7. Future Directions and Open Research Problems

Research is progressing in several directions:

Multi-modal feature generation integrating text, tabular, image, and structured data (e.g., transformer-GAN hybrids) (Chandra, 15 Mar 2025).
Design and tuning of adaptive selectors, e.g., smarter multi-armed bandits or reinforcement learning for reasoning mode exploration (Han et al., 25 Jun 2025).
Self-improving systems with integrated feedback loops (e.g., RStar-Math, OCTree) for iterative enhancement of feature rules (Chandra, 15 Mar 2025).
Hybrid and federated methods to address data privacy by decentralized learning or anonymized context descriptions (Li et al., 31 Mar 2025).
Incorporation of fairness metrics, explainability annotations, and human-in-the-loop controls for robust, trustworthy deployments (Dai et al., 5 Oct 2025).

Summary Table: Key Methodological Components in LLM-Powered Feature Engineering

Component	Description	Representative Frameworks
Context-aware Prompt	Uses dataset + metadata for tailored transformation	CAAFE, FeRG-LLM, LLM-FE
Structured Reasoning	Explicit deduction, analogical, causal, CoT/ToT	REFeat, LFG
Iterative Feedback	Validates, selects, and refines features per step	LLM-FE, ELLM-FT, Agent0
Embedding Enrichment	Augments tabular with LLM embedding features	(Kasneci et al., 3 Nov 2024)
Fairness Automation	LLM-based bias detection and mitigation	FairAgent (Dai et al., 5 Oct 2025)

LLM-powered feature engineering systematically advances automated and context-sensitive generation, transformation, and selection of features in machine learning pipelines. Empirical results indicate gains in predictive performance, interpretability, and operational efficiency across numerous domains. Ongoing research highlights the importance of adaptive reasoning, fairness controls, and integrating LLM knowledge with robust, data-driven feedback for the next generation of automated data science.