Papers
Topics
Authors
Recent
Search
2000 character limit reached

FAMOSE: LLM-Driven Feature Engineering

Updated 2 July 2026
  • FAMOSE is an LLM-driven framework for automated iterative feature discovery in tabular supervised learning, leveraging the ReAct paradigm.
  • It employs a closed-loop agent architecture that integrates Python execution, feature evaluation, and autonomous self-correction to refine feature proposals.
  • Empirical results across 20 classification and 7 regression tasks demonstrate FAMOSE’s robust performance, transferability, and scalability.

FAMOSE (Feature AugMentation and Optimal Selection agEnt) is a LLM–driven framework for automated, iterative feature discovery in tabular supervised learning. Building upon the ReAct (Reasoning + Acting) paradigm, FAMOSE constructs, evaluates, and selects new feature transformations, addressing a longstanding bottleneck in machine learning: efficient and inventive feature engineering over combinatorially large operator spaces. It is the first application of a ReAct agentic framework to feature engineering for both regression and classification, demonstrating robust performance and transferability across diverse real-world benchmarks (Burghardt et al., 19 Feb 2026).

1. Formal Problem Definition

Given a supervised dataset D={(xi,yi)}i=1ND = \{(x_i, y_i)\}_{i=1}^N where xiRpx_i \in \mathbb{R}^p are raw features and yiYy_i \in \mathcal{Y} indicates class labels or regression targets, the objective is to find a set of novel feature-constructing functions

fj=gj(x) ⁣:RpRf_j = g_j(x)\!: \mathbb{R}^p \rightarrow \mathbb{R}

so that after augmenting the design matrix XRN×pX \in \mathbb{R}^{N \times p} with these features, a downstream predictive model Mθ(X,F)M_\theta(X, F) achieves maximal performance. The target performance metric Eval(M;Dval)E_{\text{val}}(M; D_{\text{val}}) is ROC-AUC for classification and negated root mean squared error (RMSE) for regression (so that higher is always better). The resulting optimization can be expressed as

F=argmaxFG Eval(Mθ([X,F]);Dval)F^* = \underset{F \subseteq \mathcal{G}}{\arg\,\max} ~ E_{\text{val}}(M_\theta([X, F]); D_{\text{val}})

where G\mathcal{G} denotes the set of all valid feature operators. In practice, FAMOSE adopts a greedy strategy, iteratively seeking single-feature additions with empirical gain Δtδ\Delta_t \geq \delta (default xiRpx_i \in \mathbb{R}^p0 improvement goal).

2. ReAct-Based Agent Architecture

FAMOSE integrates an LLM agent into a closed-loop ReAct system, where reasoning and acting are interleaved. The main architectural components are:

  • Python execution sandbox: Enables compilation and validation of code for proposed feature functions.
  • Feature-evaluation API: Computes incremental gain xiRpx_i \in \mathbb{R}^p1 on held-out data for each candidate feature.

Each discovery round xiRpx_i \in \mathbb{R}^p2 (with xiRpx_i \in \mathbb{R}^p3 or until stagnation) proceeds as follows:

  1. The LLM, maintaining dynamic in-context memory, proposes a feature xiRpx_i \in \mathbb{R}^p4 as Python code ("ThoughtxiRpx_i \in \mathbb{R}^p5").
  2. The code is validated and executed. On failure, the agent self-corrects or retries (up to 10 trials).
  3. The feature is scored (xiRpx_i \in \mathbb{R}^p6) using the base learner.
  4. Proposals failing the improvement threshold xiRpx_i \in \mathbb{R}^p7 trigger further search.
  5. A post-hoc checker reruns all proposals in the round, selecting the one with maximal xiRpx_i \in \mathbb{R}^p8.
  6. Selected features augment the current set, and discovery continues.

The complete pipeline is finalized by applying minimum-Redundancy Maximum-Relevance (mRMR) selection to the union of original and generated features, compacting the final set xiRpx_i \in \mathbb{R}^p9.

Logical Flow Summary (Pseudocode)

fj=gj(x) ⁣:RpRf_j = g_j(x)\!: \mathbb{R}^p \rightarrow \mathbb{R}3 (From (Burghardt et al., 19 Feb 2026); see also original for full algorithm notation.)

3. Core Algorithms and Mathematical Underpinnings

  • Feature Operators: yiYy_i \in \mathcal{Y}0 may compose elementary functions including arithmetic, powers, logarithms, exponentials, aggregations, and date/time transformations.
  • mRMR Subset Selection: Selects a subset yiYy_i \in \mathcal{Y}1 of size yiYy_i \in \mathcal{Y}2 maximizing

yiYy_i \in \mathcal{Y}3

where yiYy_i \in \mathcal{Y}4 denotes mutual information (approximated by correlation or F-statistic). This balances feature relevance and redundancy.

  • Evaluation Metrics:

    • For multi-class classification:

    yiYy_i \in \mathcal{Y}5 - For regression:

    yiYy_i \in \mathcal{Y}6

4. Experimental Evaluation and Quantitative Findings

FAMOSE was evaluated across 20 classification tasks (yiYy_i \in \mathcal{Y}7 from 452 to 1,000,000; yiYy_i \in \mathcal{Y}8 from 5 to 280) and 7 regression tasks (yiYy_i \in \mathcal{Y}9 from 517 to 20,640; fj=gj(x) ⁣:RpRf_j = g_j(x)\!: \mathbb{R}^p \rightarrow \mathbb{R}0 from 7 to 13). Representative datasets included adult, covtype, bank_marketing, bike, and housing.

  • Protocol: 5-fold (stratified for classification) cross-validation; XGBoost base model (default hyperparameters, seed=42).
  • Baselines: AutoFeat, OpenFE (classical); FeatLLM, CAAFE (LLM-based).
  • Robustness: XGBoost-trained features were transferred to Random Forest and AutoGluon; both Sonnet 3.5 V2 and Deepseek-R1 backends were tested.

Key Quantitative Results:

Task Type Key Result Comparison
Classification Analyzes 100% tasks (baselines fail on large datasets) Baselines incomplete
On fj=gj(x) ⁣:RpRf_j = g_j(x)\!: \mathbb{R}^p \rightarrow \mathbb{R}1: +0.23% ROC-AUC gain (p<0.05) vs baseline
Overall mean fj=gj(x) ⁣:RpRf_j = g_j(x)\!: \mathbb{R}^p \rightarrow \mathbb{R}2 ROC-AUC = +0.32% Matches/exceeds CAAFE, FeatLLM
Regression 100% task success (AutoFeat 71%, OpenFE 86%) Baseline gaps
Mean RMSE reduction = –2.0% (p=0.07 vs best baseline) SOTA
Forest-fires RMSE: 92.7 → 79.5 (–14.3%) Example improvement

Additional findings:

  • Features trained for XGBoost transferred to Random Forest (+1.2% AUC) and AutoGluon (+0.02%).
  • Deepseek-R1 backend gave nearly identical performance (±0.1%).

5. Distinctive Advantages and Empirical Analysis

  • Agentic ReAct Loop: The iterative, tool-augmented ReAct loop is central. The agent’s context window compounds trial history, functioning as a dynamic few-shot prompt; this yields increasingly refined, data-grounded feature proposals.
  • Compilable Code Generation: All feature operators are specified explicitly as Python code, ensuring interpretability and supporting automatic error correction.
  • Autonomous Self-Correction: Runtime code validation and feature evaluation (reward signal) limit hallucinations and maintain alignment with empirical data properties.
  • Post-Hoc Selection: Post-processing with mRMR prevents spurious or redundant feature induction and compacts the final discovered set.

Ablation studies confirm:

  • Removing the 1% improvement “goal” discourages exploration and degrades performance.
  • Dropping mRMR selection harms both accuracy and compactness.

6. Limitations and Outlook

  • Computational Overhead: The method is token-intensive and retrains models repeatedly within the search loop.
  • Model Size Dependence: High-capacity LLMs are essential; smaller (<8B parameter) open-source models are inadequate.
  • Residual Hallucination Risk: Some hallucinated feature proposals persist but are mitigated by post-processing checks.
  • Application Scope: Extension to multi-label, temporal, or multimodal domains may necessitate tailored modification or retrieval-based augmentation.

A plausible implication is that the explicit, context-informed feedback loop is critical for inventive, data-centric agentic discovery, outperforming static or template-driven approaches especially on large, complex datasets. FAMOSE’s architecture has supported the hypothesis that agentic LLMs “learn” to propose increasingly effective operator schemas during iterative interaction with real data (Burghardt et al., 19 Feb 2026).

7. Relation to Broader Research and State of the Art

FAMOSE unifies and extends prior disparate approaches to automated feature engineering. It advances beyond classical systems (AutoFeat, OpenFE) by scaling to large datasets and beyond prompt-only LLM competitors (CAAFE, FeatLLM) via its closed ReAct loop and in-context “memory.” Empirical evidence places it at or above state of the art for both regression and classification. Its interpretability and transferability, demonstrated by cross-model and cross-backend performance robustness, underscore the distinctiveness of the ReAct agent framework for iterative, data-driven feature discovery (Burghardt et al., 19 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FAMOSE.