Knowledge-Informed Auto Feature Extraction

Updated 26 November 2025

Knowledge-informed automatic feature extraction is a paradigm that integrates expert heuristics, symbolic knowledge graphs, and foundation models into feature engineering.
It enhances model interpretability and predictive power by embedding semantic and structural domain insights directly into automated data transformations.
Applications in healthcare, finance, and physics show empirical gains with improved AUROC and MRR metrics, highlighting its scalability and robustness.

Knowledge-informed automatic feature extraction refers to the use of domain knowledge—whether encoded as expert heuristics, external databases, symbolic knowledge graphs, formal scientific priors, or foundation models—within the automated or machine-driven construction of new feature representations for downstream supervised and unsupervised learning. Unlike purely data-driven feature learning, this paradigm actively incorporates external semantic or structural information at various stages of the feature engineering pipeline, with the goals of enhancing predictive power, interpretability, and generalizability across domains such as healthcare, finance, scientific data analysis, and tabular machine learning. The following sections systematically review state-of-the-art frameworks, methodological innovations, quantitative benefits, and open challenges in knowledge-informed automatic feature extraction.

1. Fundamental Principles and Paradigms

Knowledge-informed automatic feature extraction encompasses a spectrum of frameworks that systematically operationalize domain knowledge through both explicit and implicit mechanisms:

Explicit domain integration: Integration of symbolic knowledge graphs, scientific ontologies, clinical decision rules, or expert-annotated risk scores directly into the transformation or selection of features (Bouadi et al., 1 Jun 2024, Björneld et al., 8 Apr 2025).
Implicit, foundation-model-powered approaches: Leverage LLMs or foundation models as agents capable of reasoning over open-world and domain-specific knowledge, guiding search over complex transformation spaces and producing interpretable or semantically justified features (Lin et al., 2023, Abhyankar et al., 18 Mar 2025, Bradland et al., 19 Nov 2025).
Physics-inspired architecture and priors: Enforce invariances (permutation, Lorentz, conservation laws) and encode problem structure directly in the data representation and model architecture, as seen in high-energy physics applications (Bhardwaj et al., 24 Apr 2024).

Key characteristics of these approaches include (i) the direct influence of domain-relevant semantics or mathematical invariants in defining permissible transformations, compositional operators, or neural network layers, and (ii) a tendency toward greater feature interpretability or human-readability versus traditional data-only methods.

2. Key Methodologies and Architectures

A variety of systematic methodologies have emerged for knowledge-informed feature extraction, varying by the nature of the domain knowledge and the automation paradigm:

a. Rule-Based and Knowledge-Graph-Driven Pipelines

KRAFT employs a two-stage architecture: a deep reinforcement learner proposes candidate feature transformations, while a symbolic reasoner over a domain-specific knowledge graph enforces interpretable feature construction. Transformations proceed only if their semantic composition (Description Logics concepts/roles) passes rigorous interpretable constraints, such as unit compatibility or domain concept validity. The formal FE objective is

$\mathcal{F}^* = \arg\max_{F\subseteq F^r\cup F^g} \mathcal{E}\bigl(L(F,Y)\bigr) \text{ subject to } \forall f_i\in\mathcal{F}^*: \mathcal{I}_{KG}(f_i)=1$

where interpretability is determined via subsumption checking in the knowledge graph (Bouadi et al., 1 Jun 2024).

b. LLM- and FM-Based Agent Systems

Rogue One features a tri-agent, decentralized LLM framework. The Scientist agent sets high-level directions using both prior cycle feedback and retrieval-augmented queries; the Extractor agent generates and justifies candidate features with code in alignment with the focus area; the Tester agent evaluates, prunes, and records quantitative and qualitative feedback. A flooding–pruning strategy ensures that feature diversity and utility are balanced, while external knowledge is injected via RAG (Bradland et al., 19 Nov 2025).
LLM-FE frames feature engineering as an evolutionary program search, with an LLM policy generating transformations and a tightly coupled evolutionary buffer feeding back top-performing programs as in-context exemplars to guide future LLM proposals. This structure directly integrates domain knowledge via both prompt composition (natural language descriptions, in-context code) and empirical validation scores (Abhyankar et al., 18 Mar 2025).
SMARTFEAT utilizes foundation models to restrict operator space and transformation candidates using explicit semantic prompts, avoiding the combinatorial blowup of traditional AFE. This includes human-interpretable descriptions and domain-typical transformations prioritized by model confidence (Lin et al., 2023).

c. Event-Based and Scientific-Domain Feature Engineering

aKDFE applies event- and patient-centric transformations, structuring feature extraction in EHR event streams as an automated two-step pipeline: (i) event-level numeric/categorical composite feature generation; (ii) patient-level aggregation via N-gram counts and sum statistics. Domain risk scores (e.g., Janusmed) are optionally incorporated as feature inputs, though empirical findings indicate the highest AUROC gains are from patient-centric transformation, not from off-the-shelf clinical risk factors (Björneld et al., 8 Apr 2025).
In scientific and physics domains, architectures such as Deep Sets and Lorentz-equivariant GNNs encode physical symmetries, event invariances, and conservation properties, allowing neural feature learning that is faithful to fundamental structural constraints (Bhardwaj et al., 24 Apr 2024).

d. Bio-Inspired Trainable Filters

In signal and image processing, knowledge-informed extractors such as COPE (audio) and B-COSFIRE (vision) automatically configure their parameters from single prototype examples. These exploit biological priors (e.g., cochlear energy peak patterns, orientation-selective receptive fields) as “seed” domain knowledge in the extractor structure (Strisciuglio, 2018).

3. Mathematical Formulations and Theoretical Foundations

The application of domain knowledge manifests in several mathematical forms throughout these approaches:

Transformation selection constrained by knowledge graphs:

$Dom_F = \bigcup_{i=1}^p \left\{ (f_{s_1},\dots,f_{s_i}) \mid 1\leq s_1\leq\cdots\leq s_i\leq p \right\} \times T_i$

with search space explored under DL-based interpretability constraints (Bouadi et al., 1 Jun 2024).

Bilevel optimization for feature and model selection:

$f^* = \arg\min_{f} \mathcal{L}_f\bigl(f(\mathcal{T}(X_{\mathrm{train}})), Y_{\mathrm{train}}\bigr), \quad \mathcal{T}^* = \argmax_{\mathcal{T}} \mathcal{E}\bigl(f^*(\mathcal{T}(X_{\mathrm{val}})), Y_{\mathrm{val}}\bigr)$

as formalized in LLM-FE (Abhyankar et al., 18 Mar 2025).

Patient-centric aggregation for event data:

$s_j(p) = \sum_{e\in E_p} f_j(e), \quad c_j(p) = |\{e \in E_p : f_j(e) \neq 0\}|$

with temporal decay kernels as potential extensions (Björneld et al., 8 Apr 2025).

Lexical scoring and qualitative assessment:

$\mathrm{Score}(f) = \alpha I(f) + \beta S(f) - \gamma R(f)$

where $I$ is normalized import, $S$ is stability, $R$ is redundancy (Bradland et al., 19 Nov 2025).

Physics-informed feature architectures rigorously encode symmetries (permutation, Lorentz, IRC) at the architecture or layer level, e.g., permutation-invariant functions via Deep Sets, gauge-invariant GNNs (Bhardwaj et al., 24 Apr 2024).

4. Empirical Evidence and Quantitative Impact

Evaluation across diverse domains quantifies the substantial gains of knowledge-informed AutoFE:

Clinical event prediction: In aKDFE, adding patient-centric transformation step produces a highly significant AUROC increase (mean up to 0.998 on EHR data predicting ventricular arrhythmias), orders of magnitude above baseline event-based features or the addition of risk scores alone (p < 10⁻⁵ for patient-centric aggregation) (Björneld et al., 8 Apr 2025).
Tabular machine learning: Rogue One achieves mean reciprocal rank (MRR) of 0.76 (classification, 19 datasets) and 0.91 (regression, 9 datasets), outperforming all evaluated competitors (e.g., LLM-FE MRR 0.52; OpenFE 0.33). Case studies show not only improved prediction, but surfacing of novel, plausible biomarkers (e.g., the WBC/ESR ratio for heart failure) (Bradland et al., 19 Nov 2025).
Interpretability: KRAFT improves prediction F₁ by ≈18.5% over no-FE baselines and by 2.6–6.4% over RL- or rule-based competitors, with all retained features certified interpretable under DL rules (Bouadi et al., 1 Jun 2024).
Efficiency and scalability: SMARTFEAT achieves 4–13% AUC improvement over raw features in a variety of tabular settings, leveraging foundation model guidance to reduce computational overhead of feature search (Lin et al., 2023).
Scientific and physics settings: ParticleNet achieves ROC-AUC ∼0.985 in jet tagging; LorentzNet achieves AUC = 0.9839 using only 5% of training data, demonstrating the generalization and sample efficiency from built-in symmetries (Bhardwaj et al., 24 Apr 2024).

5. Interpretability, Scientific Utility, and Generalization

A distinguishing property of knowledge-informed AutoFE is its ability to ensure semantic interpretability and to support scientific discovery:

Interpretability enforcement: In KRAFT, only KG-validated features pass to downstream models. The feature construction process can reject, for example, the sum of quantities with mismatched units, or aggregations violating physical constraints (Bouadi et al., 1 Jun 2024). In Agent-based frameworks, systematic justification for each generated feature (e.g., human-readable explanations aligning with the Focus Area) is integral (Bradland et al., 19 Nov 2025).
Scientific insight: The Rogue One system surfaced a previously unstudied composite feature (WBC/ESR) as a candidate biomarker for heart failure, with explicit encouragement for experimental validation. This demonstrates that knowledge-informed AutoFE can contribute hypothesis generation beyond pure predictive analytics (Bradland et al., 19 Nov 2025).
Modality transfer and generalization: Structural patterns—such as patient-centric aggregation, event-embedding via temporal kernels, or graph-based neural encodings—generalize to domains as diverse as healthcare, fraud, maintenance, customer journey analysis, and fundamental physics (Björneld et al., 8 Apr 2025, Bhardwaj et al., 24 Apr 2024).
Bio-inspired extractors: Methods such as COPE and B-COSFIRE yield interpretable filters that correspond directly to prototypical events or curvilinear structures, maintaining high performance under extreme data scarcity (Strisciuglio, 2018).

6. Limitations, Open Problems, and Future Directions

Despite its successes, knowledge-informed AutoFE currently faces several challenges:

Coverage of domain knowledge: Knowledge graph and RAG-based methods depend on the breadth and granularity of encoded domain entities. Unknown or poorly-specified feature transformations may be rejected or remain unexplored (Bouadi et al., 1 Jun 2024, Bradland et al., 19 Nov 2025).
Scalability to high-dimensional/transient feature spaces: While foundation models can restrict the candidate space efficiently, combinatorial explosion remains a risk without expert-guided proposals or aggressive pruning (Lin et al., 2023).
Validation beyond AUROC: Single-metric evaluation may inadequately capture calibration, decision-theoretic value, or biomedical impact. Future work recommends multi-metric frameworks including calibration loss and decision curve analysis (Björneld et al., 8 Apr 2025).
Temporal and sequential structure: Current RNN and simple feed-forward models may be inadequate for complex temporal data; future directions include transformer-based or attention-augmented encoders to capture event sequence patterns (Björneld et al., 8 Apr 2025).
Towards graded interpretability: Binary reasoners (interpretable or not) may be too restrictive; a graded interpretability index could support a more nuanced trade-off between accuracy and transparency (Bouadi et al., 1 Jun 2024).

A plausible implication is that integrating richer, modular domain knowledge (including foundation models as retrieval or reasoning modules), expanding symbolic coverage, and explicit multi-objective formulations (accuracy, interpretability, stability) will further broaden the impact and trustworthiness of knowledge-informed automatic feature extraction frameworks across scientific, biomedical, and commercial contexts.