Taxonomy-Augmented Feature Engineering

Updated 12 November 2025

Taxonomy-augmented feature engineering is a methodology that systematically integrates domain taxonomies into machine learning pipelines to constrain feature selection and enhance model interpretability.
It leverages techniques such as group-based feature selection, taxonomy-driven data augmentation, and hierarchical label supervision to achieve significant computational and statistical gains.
Applications in trajectory analysis, microbiome classification, and fine-grained image/text categorization demonstrate its effectiveness in reducing search space and aligning learning with domain knowledge.

Taxonomy-augmented feature engineering is a methodology that systematically integrates domain or data-driven taxonomies into the feature construction, selection, or representation process within machine learning pipelines. By leveraging the inherent hierarchical, semantic, or structural relationships among features or entities as defined by a taxonomy, this paradigm improves statistical efficiency, interpretability, and computational tractability. Applications span trajectory analysis, microbiome data, fine-grained image and text classification, and semantic matching, where either features themselves or domain objects (e.g., taxa, labels, questions) possess natural taxonomic organization (Samarasinghage et al., 25 Jun 2025, Su et al., 2021, Gupta et al., 2021, Chaussard et al., 4 Jul 2025, Škrlj et al., 2019).

1. Fundamental Principles of Taxonomy-Augmented Features

Taxonomy-augmented feature engineering exploits structured groupings and relationships among either input features or target labels to constrain, guide, or enhance feature extraction and selection. The taxonomic organization may be derived from scientific knowledge (e.g., biological hierarchies, physical motion categories), linguistic resources (e.g., WordNet), or constructed for task-driven attributes (e.g., question type hierarchies).

Core mechanisms include:

Grouping features or labels into taxonomically coherent sets (e.g., geometric vs. kinematic movement features (Samarasinghage et al., 25 Jun 2025); species↔genus↔family for microbiome (Chaussard et al., 4 Jul 2025)).
Computing features at multiple taxonomic resolutions (e.g., Phylum, Order for semi-supervised learning (Su et al., 2021)).
Engineering semantic vectors based on background taxonomies (e.g., hypernym chains in tax2vec (Škrlj et al., 2019)).
Using taxonomy-based loss or sampling mechanisms for data augmentation and learning (e.g., hierarchical losses, tree-structured priors).

This approach mitigates feature explosion, enables groupwise reasoning, and aligns model operations with domain ontologies.

2. Methodologies and Algorithmic Frameworks

Multiple methodological archetypes exist within taxonomy-augmented feature engineering:

A. Taxonomy-Based Feature Selection

For problems such as trajectory classification, features are partitioned by domain-informed taxonomies (e.g., geometric/kinematic; curvature, indentation, speed, acceleration). Instead of combinatorial subset selection on $n$ variables, the taxonomy imposes a group structure: features are only selected or excluded as entire taxonomic groups. Thus, the search space reduces from $2^n-1$ to $2^k-1$ where $k$ is the number of taxonomic groups ( $k=4$ in (Samarasinghage et al., 25 Jun 2025)). The canonical taxonomy-based feature selection algorithm is as follows:

for mask in 1 .. (2^k – 1):
    chosen_groups = groups where bit(mask)=1
    X_sub = concatenate features in chosen_groups
    score = CrossValScore(model, X_sub)
    if score > best_score:
        best_score = score
        best_subset = chosen_groups

This yields $O(2^k)$ model fits; for $k=4$ , only 15 possible subsets.

B. Taxonomy-Based Data Augmentation

Within microbiome-trait classification, TaxaPLN (Chaussard et al., 4 Jul 2025) encodes taxonomic relationships via hierarchical Poisson log-normal generative models, in which latent variational transitions and conditional emission distributions are matched to a taxonomic tree. Synthetic samples are generated by sampling from a VAMP mixture prior over encodings of observed data, enforcing taxonomic and ecological coherence by design.

C. Taxonomy-Based Label Supervision and Consistency

In semi-supervised fine-grained classification, coarse taxonomic labels (e.g., Phylum, Order) are injected as supervision at internal network layers using hierarchical marginalization and cross-entropy losses. This can be directly integrated into state-of-the-art methods such as FixMatch, with consistency regularization enforced at both taxonomic and finest-grained levels (Su et al., 2021).

D. Taxonomy-Induced Semantic Feature Construction

For short text classification, the tax2vec system (Škrlj et al., 2019) uses a background concept taxonomy (e.g., WordNet) to project documents into high-level semantic feature spaces. Semantic features are extracted through word-sense disambiguation, hypernym-path extraction, and TF-IDF aggregation over selected taxonomic concepts, with feature selection heuristics (betweenness, PageRank, mutual information) operating directly on the induced corpus taxonomy.

E. Hybrid Architectures for Semantic Matching

In semantic question matching, a taxonomy of answer types (coarse/fine classes, focus) is predicted for each question and encoded as one-hot or embedding vectors. These features are concatenated with deep encoder features and used jointly for ranking/classification (Gupta et al., 2021).

3. Computational Efficiency and Statistical Performance

The principal justification for taxonomy-augmented feature engineering is the drastic reduction in the model selection or feature search space, directly translating to computational and statistical gains. In trajectory feature selection (Samarasinghage et al., 25 Jun 2025):

Approach	Model Fits Required	Typical Runtime (Largest Dataset)	Weighted-F1 (Median Best)
Forward/Backward Wrapper	~5000	40 min	0.541–0.751
Taxonomy Approach	15	2 min	0.491–0.759

Taxonomy-based methods outperformed or matched wrappers in 67% of comparisons, with $\sim$ 90% reduction in CPU cost. Similar empirical patterns hold in text classification, where adding tax2vec features to SVMs modestly increases micro-F1, especially in few-shot settings (Škrlj et al., 2019), and in question matching, where taxonomy features add 3–5% absolute recall/MRR to deep baselines (Gupta et al., 2021).

In the context of microbiome data, taxonomy-based augmentation (TaxaPLN) yields significant mean AUPRC increases over baseline and state-of-the-art non-taxonomic methods, particularly for non-linear classifiers (+4.1% for MLP, +2.6% for XGBoost) (Chaussard et al., 4 Jul 2025).

4. Interpretability and Domain Insight

Taxonomy-augmented feature engineering enables interpretable, group-level explanations by revealing which taxonomic properties or groupings drive model decisions. For example:

Trajectory Data: Model selection identifies which domain-level descriptors (curvature, speed, acceleration) are critical for discrimination, allowing statements such as “our model depends on curvature information” instead of referencing opaque feature indices (Samarasinghage et al., 25 Jun 2025).
Microbiome Augmentation: TaxaPLN ensures that synthesized profiles reflect taxonomic ancestry, preserving ecological beta and alpha diversity metrics and producing human-interpretable, biologically plausible samples (Chaussard et al., 4 Jul 2025).
Text/Question Data: Each feature directly corresponds to a human-readable concept (WordNet synset, question type), yielding fully transparent semantic input representations (Škrlj et al., 2019, Gupta et al., 2021).

Such groupwise interpretability is impossible when features are manipulated individually, as in conventional wrapper-based selection or dense deep feature extractors.

5. Applications and Empirical Outcomes

Domains with natural taxonomic or hierarchical structure benefit most clearly:

Trajectory Analysis: Efficient, interpretable grouping of motion features into geometric and kinematic axes yields order-of-magnitude speedup in model search and enhances predictive stability across RandomForest, XGBoost, MLP, and logistic regression (Samarasinghage et al., 25 Jun 2025).
Microbiome Classification: TaxaPLN’s synthetic data preserve community structure, sparsity, and compositionality, outperforming non-taxonomic augmenters in low- and moderate-data regimes (Chaussard et al., 4 Jul 2025).
Fine-Grained Visual Categorization: Semi-supervised learning with taxonomy-aware loss and marginalization leverages coarse (Phylum, Order) labels to provide up to +6% accuracy over baseline and further boosts with hierarchical FixMatch (Su et al., 2021).
Short Text Classification and Semantic Matching: Taxonomy-informed semantic features or question types supplement learned representations, providing measurable and interpretable gains, especially when labeled data are scarce (Škrlj et al., 2019, Gupta et al., 2021).

6. Limitations, Practical Guidelines, and Extensions

Key limitations include:

Overhead in constructing or learning the taxonomy (domain-expert effort, pre-existing ontologies).
For generative approaches (TaxaPLN), model capacity tuning and training time (up to 15 min GPU per cohort) may be non-trivial compared to naive Mixup-type methods (Chaussard et al., 4 Jul 2025).
Diminishing returns at ultra-fine label granularity or for very large flat taxonomies (Su et al., 2021, Škrlj et al., 2019).

Practitioner guidelines emphasize:

Always structure features or labels using existing domain taxonomies where feasible.
For feature selection, default to group-level enumeration before considering greedy, unstructured subset search.
In few-shot or low-labeled regimes, prioritize interpretably taxonomized features for stability.
Validate that synthetic augmented data match domain-specific statistical properties (e.g., diversity metrics in microbiome data (Chaussard et al., 4 Jul 2025)).
Use taxonomy filtering to handle unlabeled or out-of-domain data when ranking reliability is required (Su et al., 2021).

Potential extensions include learning taxonomic embeddings beyond fixed marginalization, edge-parameterization for non-tree taxonomies, more expressive conditional mechanisms (e.g., attention), and latent-space Mixup for structured generative models.

7. Broader Significance and Outlook

Taxonomy-augmented feature engineering offers a unified framework that bridges domain knowledge with statistical learning. By leveraging the structured relationships implicit in science, language, and cognition, this approach enables efficient, interpretable, and domain-aligned machine learning pipelines. It is particularly impactful in high-dimensional, structured, or data-scarce contexts where arbitrary feature selection and unstructured representations hinder inference and understanding. As taxonomies and ontologies become increasingly prevalent across domains, taxonomy-based engineering is positioned as a central methodology for robust, explainable artificial intelligence (Samarasinghage et al., 25 Jun 2025, Chaussard et al., 4 Jul 2025, Su et al., 2021, Gupta et al., 2021, Škrlj et al., 2019).