Mean Decrease Impurity (MDI) in Tree Models
- Mean Decrease Impurity (MDI) is a metric that quantifies the cumulative reduction in impurity from tree splits, accurately ranking feature importance in models like CART and random forests.
- It uses impurity measures such as variance for regression and Gini index or entropy for classification, providing both global and local insights into model behavior.
- Advanced variants like MDI-oob and MDI+ address bias and instability issues in deep trees and correlated datasets, enhancing reliability and interpretability.
Mean Decrease Impurity (MDI) is a central metric in tree-based machine learning algorithms—especially Classification and Regression Trees (CART), random forests, and their generalizations—for quantifying the importance of individual input variables. MDI is computed as the cumulative reduction in node impurity (e.g., variance in regression, Gini index or entropy in classification) attributable to each variable across all splits and trees. While MDI is frequently used for feature ranking and interpretation, modern theoretical and empirical studies provide a detailed account of both its behavior and its limitations under different modeling assumptions.
1. Definition and Formal Calculation of Mean Decrease Impurity
MDI quantifies the expected total decrease in a chosen impurity measure (variance, Gini, entropy) owing to splits performed on a given variable during tree construction. Consider a single decision tree grown by the standard CART procedure:
- For each split at node , the impurity decrease when splitting on variable at threshold is:
where is the impurity in a node, and are the left/right child nodes, and is the sample size in node .
- The MDI for in tree is then:
where is the proportion of samples reaching node .
- In a random forest, global importance is obtained by averaging over all trees:
For classification, the impurity is typically the Gini index or entropy; for regression, the variance. MDI can also be generalized to “local” (per-instance) versions by restricting the sum to the splits along the path traversed by a specific test point (Sutera et al., 2021).
2. Theoretical Properties and Interpretation
Under conditions of feature independence and absence of interactions, MDI admits a clear interpretation as an exact variance (or entropy) decomposition of the regression (or classification) function. If the true regression function is additive and the input variables are independent, the importance assigned to converges (in the population) to the marginal variance of its corresponding component:
where is the additive component (Scornet, 2020). Summing MDI over all variables yields the explained variance:
where denotes the leaf assigned to . The ratio therefore corresponds to in regression and a similar “explained information” ratio in classification settings.
In the limit of infinitely randomized or sufficiently deep trees, global MDI coincides with the Shapley value of an associated cooperative game defined by mutual information or variance (Sutera et al., 2021):
where the characteristic function is , and denotes mutual information. This equivalence ensures properties such as efficiency, symmetry, and the null player property for MDI in these settings.
3. Relationship to Tree Adaptivity, Bias, and Consistency
MDI is intimately linked to the local adaptive behavior and bias of partitioning estimators. When the regression function varies strongly with , CART trees perform more splits along , concentrating partitions (and thus reducing bias) in strong-signal directions (Klusowski, 2019). Formally, the probability content of a terminal node along is exponentially bounded by its MDI:
with universal. As a result, strong variables with large MDI correspond to finer partitions and small node diameters in those coordinates, lowering estimator bias.
Aggregated over trees, this adaptive refinement ensures consistency in ensemble methods (e.g., random forests) under regularity conditions, even in highly multivariate or nonadditive settings (Klusowski, 2019, Blum et al., 2023).
4. Sufficient Impurity Decrease and Implications for Feature Importance
The sufficient impurity decrease (SID) condition formalizes the requirement that, for any cell, there exists an axis-aligned split reducing impurity by at least a fixed fraction . That is, for every cell ,
This ensures that greedy splitting makes systematic progress and, under this condition, theoretical error bounds for regression trees guarantee near-optimal rates for a broad class of functions—especially additive models whose univariate components satisfy a “locally reverse Poincaré” inequality (Mazumder et al., 2023). The SID condition directly supports the interpretation of high MDI: features yielding consistently large impurity decrease must be central to the reduction in prediction error and model performance.
5. Bias and Limitations of MDI in High-Dimensional and Correlated Data
MDI can exhibit systematic bias, particularly in the presence of noisy or redundant features. Analytical results show that for mutually independent and purely noisy (uninformative) features, the expected cumulative MDI assigned to such features grows with tree depth and inversely with the minimum leaf size (Li et al., 2019):
This inherent bias is exacerbated in fully-grown or deep trees.
In models with input correlations or interactions, MDI's allocation of importance can be ambiguous and tree-dependent (Scornet, 2020). Correlated predictors may “share” importance unequally; interaction effects may be attributed in a non-identifiable manner across variables. Averaging MDI over an ensemble of randomized trees stabilizes these attributions but does not remove the underlying ambiguity.
6. Debiasing and Advanced MDI Variants
To mitigate bias, several methodological advances have been introduced:
- MDI-oob: A debiased alternative using out-of-bag samples for evaluation rather than the training data used for tree construction. This decoupling reduces "double-dipping" bias, especially for deep trees, and improves identification of relevant features (Li et al., 2019).
- MDI+: An enhanced importance measure incorporating normalization and baseline correction. Each split’s impurity decrease is adjusted by a baseline (e.g., derived from a null distribution), and normalized weights ensure comparability and stability across trees and datasets (Agarwal et al., 2023).
- Deep Forest MDI with Calibration: In deep cascading forests, MDI is propagated through layers using an estimation and calibration procedure to attribute impurity reductions on derived features back to original input features, retaining interpretability in complex, multilayered models (He et al., 2023).
7. Extensions, Local Importances, and Connections to Shapley Values
Recent studies formalize “local” MDI importances, attributing impurity reductions along the specific path traversed by an individual instance, forming a complete local decomposition of the prediction for that instance (Sutera et al., 2021). Under regularity (e.g., totally randomized trees), these local importances correspond to instance-level Shapley values, satisfying additivity, efficiency, and symmetry.
MDI, in both global and local versions, thus serves as a bridge between algorithmic feature ranking and game-theoretic explanations, supporting both model-level diagnostics and instance-level interpretability.
In summary, Mean Decrease Impurity (MDI) provides a theoretically justified, computationally efficient measure for feature importance in tree-based models, reflecting both global and local adaptivity, signal strength, and impurity reduction. While MDI is robust and interpretable under independence and additivity, users must be aware of its limitations—bias toward noisy or highly splitable features, ambiguity under correlated or interacting variables, and instability in small or fully-grown trees. Advanced debiasing methods, ensemble averaging, and careful model selection are necessary for the reliable application of MDI in modern practice.