Feature Reasoning Agents

Updated 4 July 2026

Feature Reasoning Agents are systems that treat feature generation as an explicit, reasoning-driven process, replacing one-shot predictions with structured intermediate representations.
They employ specialized subagents in iterative loops to generate, filter, and validate features, as demonstrated by frameworks like ReFuGe and CoFEE.
Empirical studies across relational, unstructured, and multimodal tasks show improved predictive accuracy and interpretability via diagnostic intermediate reasoning.

Feature reasoning agents are agentic systems in which features are treated as objects of explicit reasoning rather than passive inputs. Recent work uses this pattern for generating predictive relational features over multi-table databases, enforcing reasoning control in LLM-based feature discovery from unstructured records, automating the discovery and explanation of internal features in LLMs, decomposing sarcasm understanding into specialized feature-producing agents, and grounding fine-grained visual cues through retrieval-coupled multimodal reasoning (Kim et al., 25 Jan 2026, Westermann et al., 23 Apr 2026, Marin-Llobet et al., 2 May 2026, Inoshita et al., 30 Dec 2025, Chen et al., 4 Mar 2026). Across these settings, the shared design is to replace one-shot prediction or one-shot feature proposal with structured intermediate representations, specialized subagents, iterative feedback, and explicit validation. Evaluation work further indicates that end-task success alone is not sufficient for assessing such systems: repository-level feature addition tasks benefit from intermediate reasoning supervision, and autonomous agents require structured reasoning provenance for population-level behavioral analytics (Liu et al., 27 Mar 2026, Vispute, 23 Mar 2026).

1. Conceptual scope

The recent literature frames feature construction, feature discovery, and feature explanation as reasoning-intensive problems. In relational databases, generating informative relational features requires reasoning over complex schemas and exploring a combinatorially large feature space, all without explicit supervision (Kim et al., 25 Jan 2026). In unstructured-data feature engineering, feature discovery is described as fundamentally a reasoning problem because useful abstractions must be predictive of a target outcome while avoiding leakage, proxies, and post-outcome signals (Westermann et al., 23 Apr 2026). In mechanistic interpretability, the problem is not only to find internal features in LLMs but also to generate falsifiable explanations of what those features detect (Marin-Llobet et al., 2 May 2026). In sarcasm understanding, the task is reformulated as a world model inspired reasoning process in which literal meaning, normative expectation, and speaker intention are decomposed into separate computational signals (Inoshita et al., 30 Dec 2025). In open-set fine-grained vision, feature reasoning appears as evidence-driven reasoning over category hypotheses, discriminative regions, and retrieved knowledge (Chen et al., 4 Mar 2026).

This scope is broader than classical feature engineering. The literature uses “feature” to denote predictive SQL expressions over relational data, Boolean abstractions under observability constraints, sparse autoencoder latents and raw MLP neurons, text-guided visual cues, and structured intermediate reasoning units for repository-level feature addition tasks (Kim et al., 25 Jan 2026, Westermann et al., 23 Apr 2026, Marin-Llobet et al., 2 May 2026, Chen et al., 4 Mar 2026, Liu et al., 27 Mar 2026). This suggests that the unifying notion is not a specific modality, but a workflow in which an agent explicitly proposes, tests, filters, localizes, or explains intermediate variables that mediate between raw inputs and final decisions.

2. Recurrent architectures

A consistent architectural pattern is specialization. Rather than assigning the entire search space to a single model invocation, recent systems divide the problem into subagents with narrow responsibilities and connect them through iterative loops, explicit memory, or downstream validators.

Framework	Specialized components	Core loop
ReFuGe	schema selection agent; feature generation agent; feature filtering agent	schema selection → feature generation → feature filtering until performance converges
CoFEE	Agent 1 feature discovery; Agent 2 consolidation; Agent 3 scoring	backward chaining, subgoal decomposition, verification, backtracking
WM-SAR	Literal Meaning Agent; Context Constructor Agent; Norm and Expectation Agent; Inconsistency Detector Agent; Mental State & Intention Agent	parallel agent calls + Logistic Regression aggregation
Automated interpretability	Supervisor; FeatureFinder; FeatureExplainer	feature discovery loop + explanation refinement loop
KFRA	candidate list generation; discriminative regions localisation; knowledge + region guided inference	three-stage closed reasoning loop

In ReFuGe, specialization is used to reduce the search space and cognitive burden of multi-table feature generation: the schema selection agent identifies relevant tables and columns, the feature generation agent produces diverse candidate features, and the feature filtering agent applies reasoning-based and validation-based filtering in an iterative feedback loop until no further improvement is observed (Kim et al., 25 Jan 2026). The paper explicitly states that specialization of agents reduces cognitive load on each and allows tailored prompts.

The same principle appears in other domains with different intermediate objects. CoFEE imposes four cognitive behaviors—backward chaining from outcomes, subgoal decomposition, verification against observability and leakage criteria, and explicit backtracking of rejected reasoning paths—inside a three-agent pipeline (Westermann et al., 23 Apr 2026). WM-SAR instantiates five small “cognitive” agents that run in parallel and emit low-dimensional numerical signals plus rationales, while final prediction is delegated to a lightweight Logistic Regression model rather than additional LLM calls (Inoshita et al., 30 Dec 2025). In automated interpretability, a Supervisor coordinates a FeatureFinder and a FeatureExplainer in a shared Python execution environment with persistent memory, separating unsupervised retrieval of candidate features from iterative refinement of natural-language hypotheses (Marin-Llobet et al., 2 May 2026). KFRA similarly factorizes open-set fine-grained visual understanding into candidate list generation, discriminative region localisation, and multimodal evidence fusion, with a retrieval-grounding coupling that converts retrieved knowledge into spatially grounded evidence for verification (Chen et al., 4 Mar 2026).

3. Reasoning primitives and objective signals

Feature reasoning agents typically combine free-form model outputs with explicit intermediate variables and acceptance criteria. In CoFEE, feature discovery is formalized as a constrained optimization over a feature space $\mathcal{F}$ , maximizing a quality metric $Q(F)$ subject to pre-outcome observability and leakage-prevention constraints. The framework enforces backward chaining through a recursive operator $\beta$ , requires each subgoal to be mapped into one of four categories, verifies that each candidate feature can be computed from data available before the outcome event, and records rejected branches in a reasoning tree (Westermann et al., 23 Apr 2026). The significance of this design is methodological: reasoning control is treated as a structured inductive bias over the candidate-feature space.

In ReFuGe, the central primitives are candidate pools, reasoning-based ranking, and empirical validation. The framework repeatedly constructs a reduced sub-schema $S \subset R$ , spawns $M$ parallel LLM instances for combinatorial feature exploration, filters the candidate pool $C$ into a promising subset $R$ , and retains only features that improve a sampled validation metric. Its effective feature score is

$s(f) = \Delta AUC(f) = AUC(\mathcal{T}^{\star}\cup\{f\}) - AUC(\mathcal{T}^{\star}),$

with retention defined by $V = \{r \in R \mid \Delta AUC(r) \ge \epsilon\}$ and iterative growth by $F^{(t+1)} = F^{(t)} \cup V^{(t)}$ (Kim et al., 25 Jan 2026). The framework therefore blends semantic reasoning with empirical screening rather than relying on either alone.

Other systems use analogous mixed pipelines. WM-SAR computes literal valence $Q(F)$ 0, normative expectation $Q(F)$ 1, raw difference $Q(F)$ 2, absolute discrepancy $Q(F)$ 3, sign-flip indicator $Q(F)$ 4, and intention score $Q(F)$ 5, then forms a feature vector $Q(F)$ 6 for the final classifier

$Q(F)$ 7

This design makes the final decision numerically interpretable while keeping the upstream signals agent-generated (Inoshita et al., 30 Dec 2025). KFRA computes category confidences $Q(F)$ 8, cue-specific alignment scores $Q(F)$ 9, and confidence thresholds that can trigger a self-corrective iteration when $\beta$ 0 (Chen et al., 4 Mar 2026). Automated interpretability applies a seven-metric evaluation function over candidate explanations—Detection F1, Fuzzing F1, Surprisal AUROC, EmbedSim, LLM-as-Judge, $\beta$ 1-value, and Cohen’s $\beta$ 2—then aggregates ordinal ranks, filters Pareto-dominated hypotheses, and iterates until convergence or a polysemanticity flag (Marin-Llobet et al., 2 May 2026). Across these examples, reasoning is not synonymous with verbose chain-of-thought; it is operationalized as typed intermediate structure plus numerical tests.

4. Empirical performance

The reported results show that explicit feature reasoning can improve both predictive quality and interpretability. On seven real-world relational databases from RelBench, ReFuGe achieved 75.30% average AUC, ranking 1.3 on average, outperforming the best competitor, LLM-CoT at 70.81%. Its ablations produced average rank 2.3 for ReFuGe-SS, 3.0 for ReFuGe-FF, and 3.1 for ReFuGe-FB, with the reasoning-filter identified as the most critical component, followed by feedback. The framework ran 2.4 iterations on average, and using $\beta$ 3 LLM instances gave the best trade-off (Kim et al., 25 Jan 2026).

On unstructured feature discovery, CoFEE reported Mean $\beta$ 4 (Top-10, held-out) of 0.250 versus 0.217 for vanilla prompting, Median $\beta$ 5 of 0.227 versus 0.204, Total # Features Generated of 157 versus 222, and Cost (USD) of \$\beta$618.29. The paper summarizes these differences as a 15.2% higher average Success Rate Score, 29% fewer features, and 53.3% lower cost (Westermann et al., 23 Apr 2026). In sarcasm detection, WM-SAR reached an average 75.0/75.0 in Accuracy / Macro-F1 across IAC-V1, IAC-V2, and SemEval, exceeding BERT-finetune, GPT-4.1-mini zero-shot, GPT+Chain-of-Contradiction, and CAF-I. Its ablation study showed average drops of $\beta$7 points without intention, $\beta$8 without inconsistency, $\beta$9 without the sign flag, and $S \subset R$0 without interaction features (Inoshita et al., 30 Dec 2025).

For internal model features, the automated interpretability framework improved over one-shot auto-interpretations on Gemma-2 family models and MLP neurons in weight-sparse transformers. Reported refinement win rates were 63.5 $S \subset R$1 2.6 versus 36.6 $S \subset R$2 2.6 for GPT-4o-mini over 1,600 features, 79.2 $S \subset R$3 0.2 versus 20.8 $S \subset R$4 0.2 for Gemini 2.5 Pro over 300 features, and 83.3 versus 16.7 for Claude 4 Sonnet over 60 features. On raw MLP neurons, FeatureExplainer yielded mean test accuracy 75%, with 86% of neurons above 50% accuracy (Marin-Llobet et al., 2 May 2026). In fine-grained vision, KFRA consistently surpassed both standalone large multimodal models and current agent frameworks on FGExpertBench, achieving up to 19 percent improvement in reasoning accuracy while delivering evidence-grounded interpretability in open-set fine-grained visual understanding (Chen et al., 4 Mar 2026).

An adjacent result comes from LLM-based web agents rather than explicit feature discovery. WorkForceAgent-R1 uses a rule-based R1-style reinforcement learning framework to enhance single-step reasoning and planning for business-oriented web navigation tasks and outperformed SFT baselines by 10.26-16.59% on WorkArena while achieving competitive performance relative to gpt-4o (Zhuang et al., 28 May 2025). This suggests that direct optimization of reasoning behavior may also be relevant to future feature reasoning agents, although that paper addresses workplace web navigation rather than feature generation itself.

5. Evaluation, diagnostics, and provenance

A central methodological issue is how to evaluate reasoning beyond final outcomes. RACE-bench addresses this for repository-level code agents on real-world feature addition tasks. The benchmark contains 528 real feature-addition instances from 12 open-source repositories, each paired with executable patch verification and structured intermediate reasoning ground truth covering issue understanding, file localization, implementation tasks, and step decomposition. Its dual-track framework combines patch-level metrics—Patch Apply Rate and Resolved Rate—with reasoning-level recall and over-prediction for each module (Liu et al., 27 Mar 2026). This benchmark directly targets a common misconception: final test correctness does not reveal whether an agent localized the correct files, identified the right tasks, or decomposed the change into valid steps.

The empirical findings in RACE-bench are diagnostic rather than merely comparative. On the full benchmark, AutoCodeRover reached APR = 96.2% and RR = 28.8%, TraeAgent reached APR = 79.0% and RR = 52.7%, and mini-SWE-Agent reached APR = 95.8% and RR = 70.1%. All agents obtained high Score@Goal, approximately 9.3/10, but recall dropped steadily from Files to Tasks to Steps, and Steps recall remained below 0.50 for all agents. Excluding apply failures, apply-ok/test-fail cases showed a 35.7% average decrease in recall across modules and a 94.1% average increase in over-prediction relative to fully successful cases (Liu et al., 27 Mar 2026). The stated implication is that step-level decomposition is the largest bottleneck.

Reasoning provenance work extends this diagnostic perspective from benchmarks to operational infrastructure. The Agent Execution Record introduces a structured reasoning provenance primitive in which the computational state at step $S \subset R$ 5 is $S \subset R$ 6 and the reasoning record is $S \subset R$ 7, with typed fields for Intent, Observation, Inference, and Plan version. The framework further records an envelope with investigation metadata and authority_chain, versioned plans with revision rationale, and a final verdict with category, summary, confidence score, evidence_chain, alternatives_rejected, and remediation (Vispute, 23 Mar 2026). The paper argues by non-identifiability that normalized, schema-stable, cross-run comparable provenance cannot in general be faithfully reconstructed from state checkpoints and raw traces alone. In preliminary deployment, it proposes behavioral-analytics queries, mock-replay regression testing, and confidence calibration; for a stylized 10-step run, cumulative checkpoint size was approximately 560 KB versus AER size of approximately 25–130 KB, and a sample of 20 incidents scored 5/5 on expressiveness questions such as “why step 3?” and “why plan changed?” (Vispute, 23 Mar 2026). For feature reasoning agents, this suggests that observability of reasoning should be designed as a first-class schema, not left to post-hoc reconstruction.

6. Research directions, limitations, and contested assumptions

Several recurring assumptions are challenged by the current literature. One is that unconstrained LLM prompting is an adequate mechanism for feature generation. CoFEE reports that cognitively induced feature generation can produce higher empirical predictability with fewer features and lower cost, indicating that reasoning control can function as a useful inductive bias rather than a mere prompt-formatting choice (Westermann et al., 23 Apr 2026). Another is that one-shot labels are sufficient for internal-feature interpretation. The automated interpretability framework reports sharper and more falsifiable explanations through iterative hypothesis testing, targeted prompt controls, and multi-metric ranking (Marin-Llobet et al., 2 May 2026). A third is that retrieval and reasoning can be treated independently in fine-grained vision; KFRA instead makes retrieval-grounding coupling a central design principle (Chen et al., 4 Mar 2026).

The papers also identify concrete limitations. ReFuGe notes that LLM API costs can be high, that the current $S \subset R$ 8 threshold is simple, and that future work could incorporate multi-objective loss functions or uncertainty estimates; it also states that the framework is extensible to regression, survival analysis, or link-prediction tasks by adapting the validation criterion (Kim et al., 25 Jan 2026). CoFEE notes that the domain taxonomy must be hand-curated, that the effect has so far only been tested on one binary classification task, and that performance depends on LLM capabilities and prompt formulations; proposed extensions include automated discovery of subgoal categories, incorporation of domain ontologies to guide backward chaining, and joint optimization of feature sets under downstream model performance (Westermann et al., 23 Apr 2026). RACE-bench argues for structured plan-generation modules or stronger code-schema priors and mentions auxiliary losses on file recall or step alignment as possible training objectives (Liu et al., 27 Mar 2026). AER proposes extensible domain profiles whose step fields, verdict fields, and rationale taxonomies can evolve additively and be independently versioned (Vispute, 23 Mar 2026).

Taken together, these works portray feature reasoning agents as a general methodological shift: features are generated, filtered, localized, explained, and audited through explicit intermediate structure rather than opaque end-to-end prediction. The literature does not yet converge on a single canonical architecture, but it repeatedly favors specialization, iterative refinement, typed intermediate records, and hybrid scoring that combines semantic reasoning with empirical validation (Kim et al., 25 Jan 2026, Marin-Llobet et al., 2 May 2026, Vispute, 23 Mar 2026). A plausible implication is that future systems will increasingly treat feature reasoning as both a modeling problem and an infrastructure problem: a matter of how agents reason, how that reasoning is validated, and how it is preserved for diagnosis and regression testing.