Centroid Decision Forests (CDF)
- Centroid Decision Forests (CDF) are an ensemble framework that partitions data using centroids derived from discriminative feature selection for high-dimensional classification.
- CDF employs a class separability score (CSS) to select top features and compute centroids, yielding interpretable and robust splits based on nearest-distance partitioning.
- Empirical evaluations demonstrate that CDF outperforms traditional methods in accuracy and scalability, proving effective in complex applications such as biomedical data analysis.
Centroid Decision Forests (CDF) are an ensemble learning framework for high-dimensional classification tasks, distinguished by their centroid-driven partitioning and discriminative feature selection. Unlike conventional decision trees that perform threshold-based splits on single features, CDF builds each tree by systematically selecting the most discriminative features via a class separability score (CSS) and partitioning data according to proximity to class centroids. This design yields interpretable and robust splits tailored to the underlying structure of complex, high-dimensional datasets.
1. Methodological Foundations
Centroid Decision Forests are grounded in the "partition and vote" paradigm established for decision forest models (Xu et al., 2021). The framework extends conventional decision tree methodology by replacing axis-aligned, single-feature splits with centroid-based separations in a discriminative subspace. At each node in a Centroid Decision Tree (CDT), the CSS is evaluated for all candidate features. The CSS for feature is:
where and are the mean and standard deviation for feature in class , is the number of classes, and prevents division by zero. The features with highest CSS values are selected, forming a subspace for centroid construction.
Within each node, class centroids are computed using the sample mean of the selected features:
Each sample is then assigned to the partition corresponding to the nearest centroid, determined by Euclidean distance:
This recursive centroid-based splitting mechanism deviates from common tree protocols and strives to minimize within-class sum of squares, yielding region assignments that are inherently representative of class structure.
2. Partition and Vote: Conceptual Perspective
CDF inherits the decision forest property of ensemble voting. Inference proceeds as follows: each CDT assigns test sample to a leaf, resulting in class predictions across the forest. The final ensemble output is given by majority voting:
This approach is mathematically analogous to the partition-and-vote model discussed in (Xu et al., 2021), which frames both decision forests and deep networks as partitioning into local regions and aggregating predictions within these regions. In CDF, leaves correspond to partitions defined by nearest-centroid assignments in a discriminative feature space, thus utilizing learned local geometry for robust classification.
3. Empirical Performance in High Dimensions
CDF has been empirically evaluated across 23 benchmark datasets spanning 46–12,625 features and sample sizes from 23 to 250. Each dataset is split 70%/30% for training/testing, and experiments are repeated 500 times to ensure robust estimates. The primary metrics are classification accuracy and Cohen’s kappa statistic. Results demonstrate:
- Mean accuracy: 0.871 for CDF, exceeding RF (0.836) and CART (0.735)
- Mean Cohen’s kappa: 0.734 for CDF
- Superior generalization, especially in imbalanced and high-dimensional regimes
Compared classifiers include CART, Random Forest (RF), Regularized RF (RRF), Extreme Gradient Boosting (XGB), k-Nearest Neighbors (kNN), Random kNN (RkNN), and Support Vector Machines (SVM). Across these datasets, CDF exhibits consistently stronger performance in both balanced and imbalanced scenarios, substantiating its flexibility and efficacy for high-dimensional data (Ali et al., 25 Mar 2025).
4. Interpretability and Scalability
CDF is constructed for interpretability and scalability:
- Interpretability arises from explicit CSS-based feature selection and centroid-based splits. At each node, the model communicates which features drive discrimination and how partitions are defined, facilitating direct inspection of class differentiations.
- Centroids are presented as mean vectors of selected features and can be visualized for further interpretive utility.
- Scalability is addressed by building each CDT using bootstrapped data and randomly subsetting features. Trees are constructed independently and in parallel, with fixed parameters for tree depth and feature sampling—enabling deployment on very high-dimensional datasets.
- Resource requirements scale linearly with the number of trees and selected features per node; the centroid computation avoids the combinatorial explosion of axis-aligned splitting, maintaining computational tractability.
5. Application and Implications for Scientific Data
CDF is particularly well suited for scientific and biomedical data characterized by small sample sizes, high-dimensionality, and noise—an assessment supported by results in (Xu et al., 2021). Its robustness stems from aggregating ensemble decisions across simple centroid-based partitions, effectively regularizing the model and mitigating overfitting risks prominent in deep network counterparts.
In biomedical contexts, reliable and interpretable classification is prioritized due to the high cost of misclassification. The centroid-based tree structure fosters such reliability, as splits are driven by class statistics rather than arbitrary thresholds, and partitions preserve meaningful relationships in the feature space.
6. Comparative Analysis and Hybrid Prospects
CDF exemplifies advances in feature-driven partitioning over conventional decision forests. Its methodological innovations, particularly the CSS feature selection and centroid-driven splits, position it between classical tree ensembles and prototype-based models. The partition-and-vote perspective connects CDF to deep networks, suggesting avenues for hybridization—such as deep forest architectures (Xu et al., 2021), end-to-end differentiable forests, and ensemble models leveraging complementary strengths of trees and neural networks.
A plausible implication is that future workflows may operationalize low-dimensional embeddings from deep networks as input features for CDF, exploiting hierarchical feature extraction and robust partitioning in tandem.
7. Summary
Centroid Decision Forests represent a rigorously formulated ensemble learning approach for high-dimensional, complex classification tasks. The key components are:
- Splitting strategy based on class separability score (CSS)
- Centroid construction from top discriminative features
- Partitioning by nearest-centroid assignment in informative subspaces
- Ensemble inference via majority voting across trees
Empirical evidence demonstrates superior accuracy and Cohen's kappa scores relative to conventional classifiers on challenging datasets. CDF further provides interpretability and scalability, and its underlying partition-and-vote structure enables connections to neural methods and motivates hybrid approaches. The framework is particularly effective in scientific and biomedical domains, where its robustness and transparency are critical. These methodological advances mark Centroid Decision Forests as a significant development for practical high-dimensional classification.