Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 63 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 14 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 100 tok/s Pro
Kimi K2 174 tok/s Pro
GPT OSS 120B 472 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Mean Decrease Impurity (MDI) in Tree Models

Updated 16 September 2025
  • Mean Decrease Impurity (MDI) is a metric that quantifies the cumulative reduction in impurity from tree splits, accurately ranking feature importance in models like CART and random forests.
  • It uses impurity measures such as variance for regression and Gini index or entropy for classification, providing both global and local insights into model behavior.
  • Advanced variants like MDI-oob and MDI+ address bias and instability issues in deep trees and correlated datasets, enhancing reliability and interpretability.

Mean Decrease Impurity (MDI) is a central metric in tree-based machine learning algorithms—especially Classification and Regression Trees (CART), random forests, and their generalizations—for quantifying the importance of individual input variables. MDI is computed as the cumulative reduction in node impurity (e.g., variance in regression, Gini index or entropy in classification) attributable to each variable across all splits and trees. While MDI is frequently used for feature ranking and interpretation, modern theoretical and empirical studies provide a detailed account of both its behavior and its limitations under different modeling assumptions.

1. Definition and Formal Calculation of Mean Decrease Impurity

MDI quantifies the expected total decrease in a chosen impurity measure (variance, Gini, entropy) owing to splits performed on a given variable during tree construction. Consider a single decision tree TT grown by the standard CART procedure:

  • For each split at node tt, the impurity decrease when splitting on variable XjX_j at threshold ss is:

Δ(j,s;t)=i(t)(tLti(tL)+tRti(tR))\Delta(j, s; t) = i(t) - \left( \frac{|t_L|}{|t|} i(t_L) + \frac{|t_R|}{|t|} i(t_R) \right)

where i()i(\cdot) is the impurity in a node, tLt_L and tRt_R are the left/right child nodes, and t|t| is the sample size in node tt.

  • The MDI for XjX_j in tree TT is then:

MDIT(Xj)=all nodes t:split on Xjp(t)Δ(j,st;t)\mathrm{MDI}_T(X_j) = \sum_{\text{all nodes } t: \text{split on } X_j} p(t) \, \Delta(j, s_t; t)

where p(t)p(t) is the proportion of samples reaching node tt.

  • In a random forest, global importance is obtained by averaging over all trees:

MDI(Xj)=1NTT=1NTMDIT(Xj)\mathrm{MDI}(X_j) = \frac{1}{N_T} \sum_{T=1}^{N_T} \mathrm{MDI}_T(X_j)

For classification, the impurity is typically the Gini index or entropy; for regression, the variance. MDI can also be generalized to “local” (per-instance) versions by restricting the sum to the splits along the path traversed by a specific test point (Sutera et al., 2021).

2. Theoretical Properties and Interpretation

Under conditions of feature independence and absence of interactions, MDI admits a clear interpretation as an exact variance (or entropy) decomposition of the regression (or classification) function. If the true regression function is additive and the input variables are independent, the importance assigned to XjX_j converges (in the population) to the marginal variance of its corresponding component:

limkMDITk(X(j))Var[mj(X(j))]\lim_{k \to \infty} \mathrm{MDI}_{T_k}(X^{(j)}) \approx \mathrm{Var}[m_j(X^{(j)})]

where mjm_j is the additive component (Scornet, 2020). Summing MDI over all variables yields the explained variance:

jMDIT(X(j))=Var[Y]EX[Var[YAT(x)]]\sum_j \mathrm{MDI}_T(X^{(j)}) = \mathrm{Var}[Y] - \mathbb{E}_X[\mathrm{Var}[Y|A_T(x)]]

where AT(x)A_T(x) denotes the leaf assigned to xx. The ratio jMDIT(X(j))/Var[Y]\sum_j \mathrm{MDI}_T(X^{(j)})/\mathrm{Var}[Y] therefore corresponds to R2R^2 in regression and a similar “explained information” ratio in classification settings.

In the limit of infinitely randomized or sufficiently deep trees, global MDI coincides with the Shapley value of an associated cooperative game defined by mutual information or variance (Sutera et al., 2021):

MDI(Xm)=ϕSh(Xm)\mathrm{MDI}^\infty(X_m) = \phi^{\mathrm{Sh}}(X_m)

where the characteristic function is v(S)=I(Y;S)v(S) = I(Y; S), and II denotes mutual information. This equivalence ensures properties such as efficiency, symmetry, and the null player property for MDI in these settings.

3. Relationship to Tree Adaptivity, Bias, and Consistency

MDI is intimately linked to the local adaptive behavior and bias of partitioning estimators. When the regression function f(x)f(x) varies strongly with XjX_j, CART trees perform more splits along XjX_j, concentrating partitions (and thus reducing bias) in strong-signal directions (Klusowski, 2019). Formally, the probability content of a terminal node along XjX_j is exponentially bounded by its MDI:

PX{aj(t)Xjbj(t)}exp(η4MDI(Xj;t))P_X\left\{ a_j(t) \leq X_j \leq b_j(t) \right\} \leq \exp\left( -\frac{\eta}{4} \mathrm{MDI}(X_j; t) \right)

with η>0\eta > 0 universal. As a result, strong variables with large MDI correspond to finer partitions and small node diameters in those coordinates, lowering estimator bias.

Aggregated over trees, this adaptive refinement ensures consistency in ensemble methods (e.g., random forests) under regularity conditions, even in highly multivariate or nonadditive settings (Klusowski, 2019, Blum et al., 2023).

4. Sufficient Impurity Decrease and Implications for Feature Importance

The sufficient impurity decrease (SID) condition formalizes the requirement that, for any cell, there exists an axis-aligned split reducing impurity by at least a fixed fraction δ\delta. That is, for every cell AA,

supj,b(A,j,b)δP(XA)Var(f(X)XA)\sup_{j, b} (A, j, b) \geq \delta \cdot P(X \in A) \cdot \mathrm{Var}(f^*(X) | X \in A)

This ensures that greedy splitting makes systematic progress and, under this condition, theoretical error bounds for regression trees guarantee near-optimal rates for a broad class of functions—especially additive models whose univariate components satisfy a “locally reverse Poincaré” inequality (Mazumder et al., 2023). The SID condition directly supports the interpretation of high MDI: features yielding consistently large impurity decrease must be central to the reduction in prediction error and model performance.

5. Bias and Limitations of MDI in High-Dimensional and Correlated Data

MDI can exhibit systematic bias, particularly in the presence of noisy or redundant features. Analytical results show that for mutually independent and purely noisy (uninformative) features, the expected cumulative MDI assigned to such features grows with tree depth dnd_n and inversely with the minimum leaf size mnm_n (Li et al., 2019):

EX,ε{supTTn(mn,dn)G0(T)}Cdnlog(np)mn\mathbb{E}_{X, \varepsilon} \left\{ \sup_{T \in \mathcal{T}_n(m_n, d_n)} G_0(T) \right\} \leq C \cdot \frac{d_n \cdot \log(np)}{m_n}

This inherent bias is exacerbated in fully-grown or deep trees.

In models with input correlations or interactions, MDI's allocation of importance can be ambiguous and tree-dependent (Scornet, 2020). Correlated predictors may “share” importance unequally; interaction effects may be attributed in a non-identifiable manner across variables. Averaging MDI over an ensemble of randomized trees stabilizes these attributions but does not remove the underlying ambiguity.

6. Debiasing and Advanced MDI Variants

To mitigate bias, several methodological advances have been introduced:

  • MDI-oob: A debiased alternative using out-of-bag samples for evaluation rather than the training data used for tree construction. This decoupling reduces "double-dipping" bias, especially for deep trees, and improves identification of relevant features (Li et al., 2019).
  • MDI+: An enhanced importance measure incorporating normalization and baseline correction. Each split’s impurity decrease is adjusted by a baseline b(t)b(t) (e.g., derived from a null distribution), and normalized weights ω(t)\omega(t) ensure comparability and stability across trees and datasets (Agarwal et al., 2023).
  • Deep Forest MDI with Calibration: In deep cascading forests, MDI is propagated through layers using an estimation and calibration procedure to attribute impurity reductions on derived features back to original input features, retaining interpretability in complex, multilayered models (He et al., 2023).

7. Extensions, Local Importances, and Connections to Shapley Values

Recent studies formalize “local” MDI importances, attributing impurity reductions along the specific path traversed by an individual instance, forming a complete local decomposition of the prediction for that instance (Sutera et al., 2021). Under regularity (e.g., totally randomized trees), these local importances correspond to instance-level Shapley values, satisfying additivity, efficiency, and symmetry.

MDI, in both global and local versions, thus serves as a bridge between algorithmic feature ranking and game-theoretic explanations, supporting both model-level diagnostics and instance-level interpretability.


In summary, Mean Decrease Impurity (MDI) provides a theoretically justified, computationally efficient measure for feature importance in tree-based models, reflecting both global and local adaptivity, signal strength, and impurity reduction. While MDI is robust and interpretable under independence and additivity, users must be aware of its limitations—bias toward noisy or highly splitable features, ambiguity under correlated or interacting variables, and instability in small or fully-grown trees. Advanced debiasing methods, ensemble averaging, and careful model selection are necessary for the reliable application of MDI in modern practice.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube