FAMOSE: LLM-Driven Feature Engineering
- FAMOSE is an LLM-driven framework for automated iterative feature discovery in tabular supervised learning, leveraging the ReAct paradigm.
- It employs a closed-loop agent architecture that integrates Python execution, feature evaluation, and autonomous self-correction to refine feature proposals.
- Empirical results across 20 classification and 7 regression tasks demonstrate FAMOSE’s robust performance, transferability, and scalability.
FAMOSE (Feature AugMentation and Optimal Selection agEnt) is a LLM–driven framework for automated, iterative feature discovery in tabular supervised learning. Building upon the ReAct (Reasoning + Acting) paradigm, FAMOSE constructs, evaluates, and selects new feature transformations, addressing a longstanding bottleneck in machine learning: efficient and inventive feature engineering over combinatorially large operator spaces. It is the first application of a ReAct agentic framework to feature engineering for both regression and classification, demonstrating robust performance and transferability across diverse real-world benchmarks (Burghardt et al., 19 Feb 2026).
1. Formal Problem Definition
Given a supervised dataset where are raw features and indicates class labels or regression targets, the objective is to find a set of novel feature-constructing functions
so that after augmenting the design matrix with these features, a downstream predictive model achieves maximal performance. The target performance metric is ROC-AUC for classification and negated root mean squared error (RMSE) for regression (so that higher is always better). The resulting optimization can be expressed as
where denotes the set of all valid feature operators. In practice, FAMOSE adopts a greedy strategy, iteratively seeking single-feature additions with empirical gain (default 0 improvement goal).
2. ReAct-Based Agent Architecture
FAMOSE integrates an LLM agent into a closed-loop ReAct system, where reasoning and acting are interleaved. The main architectural components are:
- Python execution sandbox: Enables compilation and validation of code for proposed feature functions.
- Feature-evaluation API: Computes incremental gain 1 on held-out data for each candidate feature.
Each discovery round 2 (with 3 or until stagnation) proceeds as follows:
- The LLM, maintaining dynamic in-context memory, proposes a feature 4 as Python code ("Thought5").
- The code is validated and executed. On failure, the agent self-corrects or retries (up to 10 trials).
- The feature is scored (6) using the base learner.
- Proposals failing the improvement threshold 7 trigger further search.
- A post-hoc checker reruns all proposals in the round, selecting the one with maximal 8.
- Selected features augment the current set, and discovery continues.
The complete pipeline is finalized by applying minimum-Redundancy Maximum-Relevance (mRMR) selection to the union of original and generated features, compacting the final set 9.
Logical Flow Summary (Pseudocode)
3 (From (Burghardt et al., 19 Feb 2026); see also original for full algorithm notation.)
3. Core Algorithms and Mathematical Underpinnings
- Feature Operators: 0 may compose elementary functions including arithmetic, powers, logarithms, exponentials, aggregations, and date/time transformations.
- mRMR Subset Selection: Selects a subset 1 of size 2 maximizing
3
where 4 denotes mutual information (approximated by correlation or F-statistic). This balances feature relevance and redundancy.
- Evaluation Metrics:
- For multi-class classification:
5 - For regression:
6
4. Experimental Evaluation and Quantitative Findings
FAMOSE was evaluated across 20 classification tasks (7 from 452 to 1,000,000; 8 from 5 to 280) and 7 regression tasks (9 from 517 to 20,640; 0 from 7 to 13). Representative datasets included adult, covtype, bank_marketing, bike, and housing.
- Protocol: 5-fold (stratified for classification) cross-validation; XGBoost base model (default hyperparameters, seed=42).
- Baselines: AutoFeat, OpenFE (classical); FeatLLM, CAAFE (LLM-based).
- Robustness: XGBoost-trained features were transferred to Random Forest and AutoGluon; both Sonnet 3.5 V2 and Deepseek-R1 backends were tested.
Key Quantitative Results:
| Task Type | Key Result | Comparison |
|---|---|---|
| Classification | Analyzes 100% tasks (baselines fail on large datasets) | Baselines incomplete |
| On 1: +0.23% ROC-AUC gain (p<0.05) | vs baseline | |
| Overall mean 2 ROC-AUC = +0.32% | Matches/exceeds CAAFE, FeatLLM | |
| Regression | 100% task success (AutoFeat 71%, OpenFE 86%) | Baseline gaps |
| Mean RMSE reduction = –2.0% (p=0.07 vs best baseline) | SOTA | |
| Forest-fires RMSE: 92.7 → 79.5 (–14.3%) | Example improvement |
Additional findings:
- Features trained for XGBoost transferred to Random Forest (+1.2% AUC) and AutoGluon (+0.02%).
- Deepseek-R1 backend gave nearly identical performance (±0.1%).
5. Distinctive Advantages and Empirical Analysis
- Agentic ReAct Loop: The iterative, tool-augmented ReAct loop is central. The agent’s context window compounds trial history, functioning as a dynamic few-shot prompt; this yields increasingly refined, data-grounded feature proposals.
- Compilable Code Generation: All feature operators are specified explicitly as Python code, ensuring interpretability and supporting automatic error correction.
- Autonomous Self-Correction: Runtime code validation and feature evaluation (reward signal) limit hallucinations and maintain alignment with empirical data properties.
- Post-Hoc Selection: Post-processing with mRMR prevents spurious or redundant feature induction and compacts the final discovered set.
Ablation studies confirm:
- Removing the 1% improvement “goal” discourages exploration and degrades performance.
- Dropping mRMR selection harms both accuracy and compactness.
6. Limitations and Outlook
- Computational Overhead: The method is token-intensive and retrains models repeatedly within the search loop.
- Model Size Dependence: High-capacity LLMs are essential; smaller (<8B parameter) open-source models are inadequate.
- Residual Hallucination Risk: Some hallucinated feature proposals persist but are mitigated by post-processing checks.
- Application Scope: Extension to multi-label, temporal, or multimodal domains may necessitate tailored modification or retrieval-based augmentation.
A plausible implication is that the explicit, context-informed feedback loop is critical for inventive, data-centric agentic discovery, outperforming static or template-driven approaches especially on large, complex datasets. FAMOSE’s architecture has supported the hypothesis that agentic LLMs “learn” to propose increasingly effective operator schemas during iterative interaction with real data (Burghardt et al., 19 Feb 2026).
7. Relation to Broader Research and State of the Art
FAMOSE unifies and extends prior disparate approaches to automated feature engineering. It advances beyond classical systems (AutoFeat, OpenFE) by scaling to large datasets and beyond prompt-only LLM competitors (CAAFE, FeatLLM) via its closed ReAct loop and in-context “memory.” Empirical evidence places it at or above state of the art for both regression and classification. Its interpretability and transferability, demonstrated by cross-model and cross-backend performance robustness, underscore the distinctiveness of the ReAct agent framework for iterative, data-driven feature discovery (Burghardt et al., 19 Feb 2026).