Hierarchical XGBoost Ensemble

Updated 9 December 2025

Hierarchical XGBoost Ensemble is a composite machine learning strategy that organizes multi-class tasks into structured decision nodes and interpretable surrogate trees.
It combines explicit hierarchical classifiers with visual tools like VITE to map feature usage and improve performance, achieving accuracies near or above 91%.
Key advantages include enhanced interpretability, efficient feature utilization, and improved regulatory compliance in complex decision-making scenarios.

A Hierarchical XGBoost Ensemble is a composite machine learning strategy that leverages the tree-based XGBoost algorithm within explicit hierarchical architectures or via post hoc “unboxing” of black-box tree ensembles. This approach has two foundational implementations: (1) explicit hierarchical multi-node classification schemes leveraging XGBoost for structured decision tasks, and (2) interpretable surrogate trees constructed to summarize and approximate the behavior of complex XGBoost ensembles with a hierarchical feature utilization view. These two lines are exemplified by Di Teodoro et al.'s VITE/MIRET framework for interpretable surrogates (Teodoro et al., 2023) and the structured fault-detection pipeline detailed in (Sami et al., 2021).

1. Explicit Hierarchical XGBoost Classification Architectures

Hierarchical XGBoost architectures segment a multi-class classification problem into a rooted decision tree of sequential XGBoost classifiers. For each decision node, an XGBoost estimator—often binary or reduced-multiclass—is trained for locally specific sub-tasks, with prediction routing determined by the output of parent nodes. This strategy decomposes tasks where output classes possess natural groupings or hierarchical structures, notably improving interpretability, error containment, and class balance.

Architectural Example: In power transformer fault identification (Sami et al., 2021), samples with 12 empirically ranked and EMD-transformed Dissolved Gas Analysis (DGA) features are routed through three XGBoost classifiers:

Level-1 (XGB₁): binary separation of Discharge-type (PD, D1, D2) vs Thermal-type (T1, T2, T3) faults.
Level-2a (XGB₂): multiclass refinement among Discharge subtypes.
Level-2b (XGB₃): multiclass refinement among Thermal subtypes.

Each classifier is parameter-tuned independently (e.g., learning rate, depth, subsample ratio via cross-validation), and final predictions follow a sequential path, enhancing sensitivity and accuracy (e.g., 91.2% average accuracy reported).

2. Hierarchical Visualization of Tree Ensembles: VITE

Where hierarchical structure is implicit rather than explicit, interpretability is achieved by hierarchical visualization of feature usage across the ensemble. Di Teodoro et al. propose a “forest→heatmap+representative-tree” method labeled VITE (Teodoro et al., 2023), which yields granular insight into feature selection frequencies and split thresholds at each depth in the ensemble.

Level-Frequency Heatmap: The $|J|\times D$ matrix $f_{d,j}$ quantifies, for each feature $j$ and tree level $d$ , the weighted frequency of selection in all estimators, correcting for imbalanced tree weights and varying pruning structures:

$f_{d,j} = \frac{1}{\sum_{e\in E}2^d} \sum_{e\in E} w^e \sum_{t\in B^e(d)} \mathbf{1}(\text{feature }j \text{ used at node } t \text{ of tree } e)$

Visualizing $f_{d,j}$ as a heatmap reveals global vs local feature importance—root-dominant features contrast with those surfacing only in elaborated subtrees (cf. “thal” at depth 0 in Cleveland data).

Node-Frequency Representative Tree: Simultaneously, one can annotate a full-depth binary “skeleton” tree, displaying at each node the empirical distribution of feature selections and threshold ranges, extracted from ensemble node statistics.

3. Multivariate Surrogate Tree Construction via MILP (MIRET)

For “unboxing” the XGBoost ensemble’s function in a sparse, interpretable form, Di Teodoro et al. introduce MIRET (Teodoro et al., 2023), which formulates the construction of a surrogate oblique tree as a mixed-integer linear program. The surrogate aims to mimic the ensemble $F_{\widehat{XGB}}(x)$ , maximize fidelity, preserve sample proximity clusterings, and minimize the number of active features and splits.

Variables and Routing Constraints:

$a_{t,j}$ : continuous coefficients for hyperplane splits at node $t$ ; $b_t$ : node threshold.
$s_{t,j}$ : binary indicator if feature $j$ used in node $t$ .
$z_{i,\ell}$ : assignment of sample $i$ to leaf $\ell$ .
$q^i_L(t), q^i_R(t)$ : left/right branching indicators.

Routing, assignment, and feature selection constraints are enforced via big- $M$ relaxations and auxiliary binary variables.

Objective:

The mixed loss balances weighted misclassification against sparsity-inducing penalties:

$\text{minimize } \frac{1}{2}\sum_{i}p^i\widehat{y}^i\Bigl(\widehat{y}^i-\sum_{\ell}c_\ell z_{i,\ell}\Bigr) + \alpha\sum_{d,j\in J_\gamma(d)}\frac{1}{f_{d,j}\sum_{t\in B(d)}s_{t,j}}$

where $\alpha$ controls the fidelity-sparsity tradeoff and $J_\gamma(d)$ restricts feature consideration by hierarchical frequency thresholds.

Proximity Constraints:

Sample pairs with high ensemble-leaf proximity $m_{i,k}=\frac1{|E|}\sum_{e}w^e\,\mathbf{1}(i,k \text{ share leaf in } e)$ must be assigned together in the surrogate.

4. Computational Implementation and Performance Benchmarks

MIRET was benchmarked using Gurobi 10.0, up to 96 GB RAM and 1 h per run. On ten UCI tabular datasets (up to ~1000 samples, 60 features):

Depths $D=2$ –$4$ attainable with full optimality for most datasets.
Training fidelity exceeds 89%–95%; test accuracy lags XGBoost by 2–4 pp, but still achieves 85%–98% depending on task.
Proximity retention >90%: high-proximity pairs remain co-located in the surrogate’s leaves.
Sparsity: surrogate trees use only 2–6 features versus XGBoost/Random Forest’s 10–30, supporting regulatory reporting and white-box requirements.

5. Feature Engineering and Hyperparameter Tuning in Hierarchical XGBoost

In explicit hierarchical classification (Sami et al., 2021), feature rank ordering by parameter skewness and extraction using Empirical Mode Decomposition precedes classifier deployment. Validation accuracy analysis selects the optimum feature subset (12 IMF coefficients feed the model).

Hyperparameter grids for each XGBoost node include:

Learning rate $\eta$ ∈ {0.01,0.05,0.1,0.2}
Number of trees ∈ {50,100,200,300}
Max depth ∈ {3,5,7,9}
Subsample ∈ {0.6,0.8,1.0}
Regularization $\gamma, \lambda$ ∈ {0,0.1,1} and {0,1,5}

Cross-validation on stratified train/test splits tunes these, optimizing for sensitivity and mean accuracy.

6. Practical Guidelines and Limitations

Data normalization ( $x_j\in[0,1]$ ) is recommended to control numerical instability in MILP routing.
Surrogate tree depth $D=2$ –3 is optimal for interpretability; $D=4$ possible but larger.
Feature selection thresholds ( $\gamma_d$ at 50–33% quantile suffices); proximity constraints can be relaxed for efficiency.
MILP scalability limits tree depth and sample count (grow as $O(|I|\,2^D\cdot|J|)$ ).
Out-of-sample fidelity may drop; further objective regularization is suggested for hold-out sets.
Use cases include regulated domains and model audits.

7. Comparative Results and Generalizability

Hierarchical XGBoost ensembles outperform baseline ratio-based, neural, and SVM classifiers on transformer fault identification, yielding average accuracy gains of 7–28% over previous methods (Sami et al., 2021). On UCI benchmarks, VITE and MIRET together produce both visually interpretable feature usage maps and high-fidelity, sparse surrogates approximating the original XGBoost decision function to within ~90% (Teodoro et al., 2023).

Both explicit and surrogate-based hierarchical ensembles generalize to time-series, tabular, and grouped-class domains. The modular composition and interpretability advantages extend to deeper hierarchies or broader multiclass splits, especially where class imbalances or regulatory interpretability requirements prevail.