ReAct-Based Iterative Feature Engineering

Updated 22 February 2026

ReAct-based iterative feature engineering is an automated process that combines reasoning and acting to iteratively propose, generate, and validate features for machine learning models.
It leverages LLM-powered chain-of-thought reasoning, code generation, and empirical evaluation to select transformations that enhance metrics such as ROC–AUC and reduce RMSE.
The approach integrates multi-agent systems, reinforcement learning, and human-in-the-loop strategies to drive feature innovation and operational efficiency.

ReAct-based iterative feature engineering refers to the integration of the ReAct (Reason + Act) paradigm into the automated, feedback-driven construction of features for machine learning models, particularly on tabular data. This approach tightly interleaves LLM–powered chain-of-thought reasoning, discrete code or transformation selection, and empirical model evaluation for feature proposal, generation, and refinement. ReAct-based systems have been developed in various forms including dialog-based agents, reinforcement learning (RL) agents, multi-agent planners, and tool-integrated pipelines, and are recognized for their capacity to autonomously learn, innovate, and validate transformation strategies with minimal human intervention.

1. Formalization of the ReAct Paradigm in Feature Engineering

The ReAct pattern decomposes each feature engineering iteration into at least two distinct steps: a reasoning phase, where the system generates hypotheses or rationales for potential new features; and an acting phase, where it executes concrete actions such as generating code to instantiate features or proposing specific transformation sequences.

This structure is explicitly realized in frameworks such as FeRG-LLM and FAMOSE. In FeRG-LLM, each loop consists of a "Reason" call, where the LLM composes high-level feature rationales, followed by an "Action" call, producing executable code for the proposed features (Ko et al., 30 Mar 2025). In FAMOSE, the agent alternates between chain-of-thought reflection (documenting which transformations to try and why) and acting via emitting, compiling, and evaluating transformation code on held-out data (Burghardt et al., 19 Feb 2026).

The ReAct decomposition creates a feedback-rich loop, operationalized in pseudocode as maintaining an evolving feature set $h_t$ , repeatedly augmenting it with candidate features $f_t$ generated by the two-stage Reason $\to$ Act process, and using downstream validation metrics such as AUC or $-\mathrm{MSE}$ to determine feature acceptance (Ko et al., 30 Mar 2025, Burghardt et al., 19 Feb 2026). This iterative mechanism, combined with automated tool or human-in-the-loop error correction in more complex multi-actor settings (Thakur et al., 15 Jan 2026), facilitates robust empirical search and correction across the combinatorially large feature space.

2. Architectures and Algorithms

ReAct-based feature engineering has been embodied in various architectural designs:

Two-Stage LLM Dialogues: FeRG-LLM executes an explicit Reason $\to$ Act turn-based pipeline, where user-provided context (domain, task type, column metadata) prompts rationales for feature construction, which are then programmatically transformed into executable code (e.g., pandas expressions) (Ko et al., 30 Mar 2025).
Agentic ReAct Loops: FAMOSE integrates a ReAct agent that inspects prior feature history and validation deltas to inform novel transformation proposals. The acting phase can include code emission, compilation, and empirical evaluation within a managed runtime (Burghardt et al., 19 Feb 2026).
RL-Based ReAct Systems: FastFT applies a cascade of RL agents, abstracting the Reason phase into performance and novelty prediction modules to prioritize transformative operations. Act steps correspond to choosing clusters and transformation operators, with state summarization over feature set statistics (He et al., 26 Mar 2025).
Planner-Guided Multi-Agent ReAct Pipelines: In multi-agent settings, such as the planner-constrained topology described in (Thakur et al., 15 Jan 2026), the planner orchestrates calls to specialized actors (e.g., code generators, config template producers, test case writers), with formalized context-aware prompt construction and retroactive correction of upstream outputs based on downstream failures, adhering to a directed acyclic workflow graph.

In all cases, acceptance of a feature is performance-driven, determined by threshold improvement over baseline metrics. Post-loop selection steps such as mRMR (minimum Redundancy Maximum Relevance) may be employed to compactify the final feature set and mitigate redundancy (Burghardt et al., 19 Feb 2026).

3. Tooling, Feedback, and Preference Alignment

ReAct feature engineering tightly integrates tooling for automated code validation, empirical evaluation, and preference-based feedback:

Evaluation Metrics: Systems universally apply target task metrics for feature acceptance: ROC–AUC for classification, $-\mathrm{RMSE}$ for regression, and related metrics for anomalies (Ko et al., 30 Mar 2025, Burghardt et al., 19 Feb 2026, He et al., 26 Mar 2025).
Preference Optimization: FeRG-LLM introduces Direct Preference Optimization (DPO) for aligning the model's chain-of-thought with empirically successful rationales. Rationales and their associated code are evaluated, split by outcome, and used to drive DPO minimization, promoting reasoning paths leading to superior downstream scores (Ko et al., 30 Mar 2025).
Reinforced Exploration and Replay: FastFT innovates by decoupling feature candidate evaluation via performance predictors (LSTM+MLP) and novelty detectors, reducing the frequency of expensive full downstream evaluations. Combined rewards balance predicted performance gain and novelty, with prioritized memory buffer replay based on TD error, and meta-reasoning about which transformation histories merit further exploration (He et al., 26 Mar 2025).
Error Propagation and Human-in-the-Loop: Multi-agent graph frameworks propagate downstream errors back to earlier actors, updating their prompt context with error traces. This mechanism ensures iterative convergence even in the presence of complex dependencies or unforeseen data issues, with policy escalation to human intervention when persistent ambiguity or repeated failures occur (Thakur et al., 15 Jan 2026).

4. Space of Transformations and Feature Construction Strategies

ReAct-based systems formalize and algorithmically traverse a broad space of candidate transformations:

Transform Space: FAMOSE and similar frameworks define the candidate universe as

$\mathcal{F}_{\mathrm{cand}} = \{ f(x) = t(x_{i_1}, x_{i_2}, ..., x_{i_k}) \mid t \in \mathcal{T},\; i_j \in \{1, ..., d\} \}$

with $\mathcal{T}$ including unary, binary, and aggregate operations, and multi-step compositions such as $f(x) = \log(1 + x_i \cdot x_j^2)$ (Burghardt et al., 19 Feb 2026).

Feature Formulae Examples: FeRG-LLM provides explicit formulae spanning interactions ( $x_i \times x_j$ ), ratios ( $f_t$ 0), polynomials ( $f_t$ 1), cross-centered terms ( $f_t$ 2), and logical bin indicators ( $f_t$ 3), with accompanying Python code snippets (Ko et al., 30 Mar 2025).
Operation Selection: RL-based agents in FastFT structure the action space as a cascade over clusters of features and transformation types (unary/binary), decisively extending breadth and novelty in the search of effective transformations (He et al., 26 Mar 2025).
Contextual Proposals: FAMOSE and FeRG-LLM leverage context inclusion in the prompt history—listing prior features and empirical deltas—to bias next-step proposal generation, analogous to few-shot prompting (Ko et al., 30 Mar 2025, Burghardt et al., 19 Feb 2026).

5. Experimental Evaluation and Empirical Impact

The ReAct paradigm has demonstrated state-of-the-art or near–state-of-the-art results on benchmark tasks and real-world settings:

Performance Gains: FAMOSE achieved up to $f_t$ 4 ROC–AUC gains on classification tasks with more than 10,000 rows and a $f_t$ 5 reduction in RMSE on regression tasks, outperforming AutoFeat, OpenFE, CAAFE, and FeatLLM (Burghardt et al., 19 Feb 2026). FastFT outperformed best baselines by $f_t$ 6– $f_t$ 7 absolute gains across F1, 1–RAE, and AUC metrics, with $f_t$ 8 across 23 datasets (He et al., 26 Mar 2025).
Efficiency: FastFT’s predictor/novelty approach achieved $f_t$ 9 run time reduction versus always evaluating candidates downstream, with negligible drop in main metrics. The prioritized replay buffer and novelty reward contributed additional $\to$ 0– $\to$ 1 and $\to$ 2– $\to$ 3 performance improvements, respectively (He et al., 26 Mar 2025).
Robustness: FAMOSE ablations established that omitting the iterative ReAct loop or feature selection degraded performance, increasing RMSE or causing premature agent convergence (Burghardt et al., 19 Feb 2026).
Production Impact: A planner-constrained topology with graph-guided actor execution reduced end-to-end ML feature engineering from approximately three weeks (manual, five engineers) to a single day, as measured on a production-scale recommender model serving 120 million users (Thakur et al., 15 Jan 2026).
Benchmarking: Pass@3 rates on a 10-task PySpark benchmark showed $\to$ 4 mean (stddev $\to$ 5) for planner-guided ReAct agents, compared to $\to$ 6 (sequential/manual) and $\to$ 7 (random) actor selection, corresponding to $\to$ 8– $\to$ 9 improvement (Thakur et al., 15 Jan 2026).

6. Extensions, Design Innovations, and Significance

ReAct-based feature engineering frameworks introduce several design extensions:

Feedback-Driven Alignment: DPO and RL-based feedback mechanisms iteratively align proposal strategies towards empirically higher-value transformations, enabling the model or agent to internalize patterns not easily accessible via static or one-shot LLM prompting (Ko et al., 30 Mar 2025, He et al., 26 Mar 2025).
Meta-Reasoning and Tool Use: Some systems embed calls to statistical toolchains or data profilers within the Reason phase to provide auxiliary diagnostics (e.g., mutual information, correlation statistics), which then contextualize or guide further actions, echoing the ReAct paradigm’s tool-calling step (He et al., 26 Mar 2025).
Error Correction and Human Cooperation: Planner-guided ReAct agents retroactively adjust upstream generation in response to downstream execution or testing failures, and escalate to human-in-the-loop support when ambiguity or persistent error surfaces, ensuring alignment with organizational conventions and reliability demands (Thakur et al., 15 Jan 2026).
Innovative Features and Generalization: Retaining and leveraging a history of prior feature deltas, as in FAMOSE, induces a human-like hypothesize–test–refine cycle, resulting in more innovative, multi-step feature constructions compared to template or brute-force methods (Burghardt et al., 19 Feb 2026).

A plausible implication is that ReAct agents, by virtue of context retention and feedback-aligned reasoning over the combinatorially large transformation space, achieve greater inventiveness and more rapid convergence to compact, high-signal feature sets than previous generation AutoML and static LLM-based approaches.

7. Limitations, Open Challenges, and Future Work

Despite empirical progress, several challenges persist:

Scalability and Evaluation Cost: Although predictor-based approximations and memory buffers reduce evaluation burden, full validation remains expensive in extremely high-dimensional or streaming contexts (He et al., 26 Mar 2025).
Robustness to Hallucinations: While iterative Reason–Act–Validate approaches mitigate incorrect or inapplicable feature proposals, some residual risk of invalid code or semantically incoherent suggestions remains, particularly as feature interactions grow in complexity (Burghardt et al., 19 Feb 2026).
Integration and Adaptivity: The constrained-topology multi-agent designs highlight obstacles in integrating ReAct-style LLM agents into heterogeneous, evolving production codebases, emphasizing the need for planner adaptivity, environment modeling, and seamless human–AI collaboration (Thakur et al., 15 Jan 2026).
Human-in-the-Loop Balance: Determining the optimal cadence and criteria for escalating to human intervention versus continuing automated correction poses an ongoing research challenge with significant implications for deployment reliability (Thakur et al., 15 Jan 2026).
Generalization to Unseen Domains: Though systems such as FeRG-LLM and FAMOSE demonstrate broad cross-task and cross-domain generality, explicit quantification of generalization to out-of-distribution data or novel schema remains an open research area (Ko et al., 30 Mar 2025, Burghardt et al., 19 Feb 2026).

Continued development of ensemble, planner-guided, and hybrid LLM–RL frameworks, along with more sophisticated preference modeling and interactive toolchains, is anticipated to further push the boundaries of automated feature engineering under the ReAct paradigm.

Key References:

FeRG-LLM: Feature Engineering by Reason Generation LLMs (Ko et al., 30 Mar 2025)
FAMOSE: A ReAct Approach to Automated Feature Discovery (Burghardt et al., 19 Feb 2026)
FastFT: Accelerating Reinforced Feature Transformation via Advanced Exploration Strategies (He et al., 26 Mar 2025)
Towards Reliable ML Feature Engineering via Planning in Constrained-Topology of LLM Agents (Thakur et al., 15 Jan 2026)