Explainable Machine Learning Methods

Updated 1 October 2025

Explainable machine learning methods are techniques designed to expose and elucidate the internal logic of models, enhancing transparency and trust.
They employ a range of approaches including global surrogates, local explanations (LIME, SHAP), and visualization tools (PD/ICE) to clarify model behavior.
These methods balance complexity and fidelity to support model debugging, regulatory compliance, and ethical AI decision-making.

Explainable machine learning methods comprise a suite of algorithmic and theoretical approaches aimed at rendering the predictions and internal logic of machine learning models transparent, interpretable, and ultimately trustworthy. These methods have evolved to meet the practical, scientific, and regulatory demands of fields where understanding model behavior is as critical as predictive performance. The landscape encompasses global surrogate modeling, partial dependence visualizations, local post-hoc attributions, game-theoretic feature allocations, and information-theoretic metrics, each with rigorous analytical underpinnings and distinct scope, fidelity, and deployment considerations.

1. Classes of Explainable Machine Learning Methods

A range of paradigms have been established for generating explanations in machine learning:

Decision Tree Surrogates: Global simplifications where a decision tree $h_{\text{tree}}$ is trained to mimic a black-box model $g$ on input-label pairs $(X, g(X))$ , supporting approximate global rules and feature importance extraction. The induction process is formalized as $X, g(X) \xrightarrow{\mathcal{A}} h_{\text{tree}}$ using a splitting/pruning algorithm $\mathcal{A}$ (Hall, 2018).
Partial Dependence (PD) and Individual Conditional Expectation (ICE) Plots: Visualizations summarizing the marginal effect of particular variables on predictions. PD average-outcomes are estimated as

$\mathrm{PD}(x_j; g) = \frac{1}{N}\sum_{n=1}^N g(x_1^n, \dots, x_{j-1}^n, x_j, x_{j+1}^n, \dots, x_P^n)$

whereas ICE plots trace $g$ along changes in $x_j$ for individual instances, revealing interaction heterogeneity (Hall, 2018).

Local Interpretable Model-agnostic Explanations (LIME): Local surrogate models $h_{\mathrm{GLM}}$ (typically sparse linear) fitted around a point $x$ to approximate $g(x)$ , solving

$\min_h \mathcal{L}(g, h, \pi_X) + \Omega(h)$

where $\mathcal{L}$ is a weighted loss over perturbed samples, $\pi_X$ emphasizes locality, and $\Omega(h)$ enforces simplicity (e.g., LASSO) (Hall, 2018).

Shapley-value Explanations (SHAP): Based on cooperative game theory, each feature's attribution $\phi_j$ is computed as

$\phi_j = \sum_{S \subseteq F \setminus \{j\}} \frac{|S|! (P-|S|-1)!}{P!} (g_x(S \cup \{j\}) - g_x(S))$

This assignment is locally accurate and globally consistent. Tree SHAP exploits tree model structures for efficient computation (Hall, 2018, Salih et al., 2023).

Counterfactual Explanations: Derive the minimal input change $c$ such that $f(c) \neq f(x)$ , formalized as

$\min_c \ d(x, c) \quad \text{subject to } f(x) \neq f(c)$

(Bhatt et al., 2019).

Influence Functions: Quantify the effect of individual training samples on predictions, e.g.,

$I_{\text{up,loss}}(z, x) = -\nabla_\theta L(\hat{f}_\theta(x), y_x)^T H_{\hat{f}_\theta}^{-1} \nabla_\theta L(\hat{f}_\theta(z), y_z)$

(Bhatt et al., 2019).

Information-theoretic and Personalized Explanation Metrics: Define explanation efficacy as the reduction in predictive uncertainty for a specific user, measured as conditional mutual information $I(e;\hat{y}|u)$ :

$I(e; \hat{y} | u) = E \left[ \log \frac{p(\hat{y}, e | u)}{p(\hat{y}|u)p(e|u)} \right]$

(Jung et al., 2020). Explainable empirical risk minimization (EERM) incorporates conditional entropy $H(h|u)$ as a regularizer (Zhang et al., 2020).

2. Scope, Fidelity, and Theoretical Guarantees

Explainability methods must be evaluated by their scope—whether they provide global, local, or hybrid explanations—and their fidelity, i.e., how well they reflect the true behavior of the original model.

Method	Scope	Fidelity & Guarantees
Decision Tree Surrogate	Global	Approximate, low fidelity; check RMSE/R² (Hall, 2018)
Partial Dependence	Global (PD), Local (ICE)	PD averages over heterogeneity; ICE reveals interactions (Hall, 2018)
LIME	Local	Sparse, interpretable, but accuracy variable—requires local error checks (Hall, 2018)
SHAP	Local/Global	Additive, locally and globally consistent; game-theoretic uniqueness (Hall, 2018)
Counterfactuals	Local	Actionable but not always feasible/realistic (Bhatt et al., 2019)
Influence Functions	Local/Model-global	Computationally demanding; may highlight outliers, not prototypes (Bhatt et al., 2019)

Shapley-value explanations guarantee additivity, local exactness, symmetry, dummy, and consistency properties, and for tree-based models, Tree SHAP provides computational tractability and accuracy (Hall, 2018, Salih et al., 2023). LIME's guarantees are local and depend on the loss regularization tradeoff and the structure of perturbed samples; fidelity is empirically evaluated using $R^2$ and RMSE (Hall, 2018). PD is justifiable when feature independence or low interaction holds; otherwise, ICE overlays can reveal when PD is misleading (Hall, 2018).

Counterfactuals are grounded in constrained optimization and provide actionable recourse but may not reflect plausible or allowable changes depending on the data manifold (Bhatt et al., 2019). Recent theoretical work has highlighted the lack of robustness for many post-hoc methods—explanations may change drastically under slight input perturbations, exposing the methods to "fairwashing" or adversarial manipulation (Galinkin, 2022). Information-theoretic frameworks, as in (Jung et al., 2020) and (Zhang et al., 2020), give a principled, quantitative basis but require modeling the user's knowledge state and may not scale easily.

3. Practical Guidance and Deployment Considerations

Best practices and cautions for deploying explainable machine learning methods include:

Combine Global and Local Techniques: Employ global models (tree surrogates, PD/ICE) for overview, and local models (LIME, SHAP) for individual decisions. Consistency across methods increases interpretability confidence (Hall, 2018).
Monitor Fidelity: Always quantify the fidelity of surrogate or local models using error metrics such as $R^2$ , RMSE, or model trust scores, especially in domains with imbalanced or skewed data (Hall, 2018, Kailkhura et al., 2019).
Cautious Use in Regulated Domains: Explainers with theoretical guarantees (notably SHAP for monotonic or credit-scoring models) are recommended when regulator-mandated reason codes or compliance are necessary (Hall, 2018, Chen, 2023).
Assess Real-World Usability: Many explainers are primarily used for model debugging by ML engineers rather than for external users; explanations may not be robust, actionable, or even understandable in operational settings (Bhatt et al., 2019).
Address Feature Collinearity and Model-dependence: Both SHAP and LIME are affected by model choice and correlated features. In high-collinearity, SHAP may attribute low importance to highly predictive, but collinear, variables (Salih et al., 2023). Preprocessing and stability checks like normalized movement rates are recommended.
Deployment Trade-offs: Real-time applications may prefer faster explainers (e.g., LIME) at the cost of some reliability, while retrospective or regulatory settings can accept slower, more robust methods (e.g., SHAP) (Psychoula et al., 2021).

4. Domain Considerations and Extensions

The integration of domain knowledge into explainability is increasingly recognized as essential for achieving scientifically meaningful and trustworthy explanations:

Physics- and Domain-informed Models: Embedding prior knowledge—such as conservation laws, chemical ontologies, or monotonicity constraints—can improve not only scientific plausibility but also the transparency of explanations (Roscher et al., 2019, Beckh et al., 2021).
Personalization: Explanations tailored to the user's background or expertise maximize informativeness, as quantified by conditional mutual information $I(e;\hat{y}|u)$ (Jung et al., 2020), and conditional entropy regularization in EERM (Zhang et al., 2020).
Monotonicity and Attribution Consistency: For monotonic models (common in credit and risk), attribution methods should align with monotonicity axioms (DIM, AIM, AWPM, ASPM). Baseline Shapley values are sufficient for individual monotonicity, while Integrated Gradients are preferable under strong pairwise monotonicity requirements (Chen, 2023).

5. Evaluation, Visualization, and Emerging Challenges

Evaluating and communicating explanations introduces new requirements:

Reproducibility and Tooling: Public software and benchmark datasets are central for reproducibility. Comprehensive software resources and example analyses are increasingly included with contemporary research (Hall, 2018, Bogdanova et al., 2022).
Visual Analytics: Advanced visual frameworks (e.g., explAIner (Spinner et al., 2019)) and new visual encodings such as General Line Coordinates (GLC) (Kovalerchuk et al., 2020) facilitate exploration across abstraction levels but pose challenges with occlusion, clutter, and high-dimensional fidelity.
Quality and Usability: Explanation quality is not yet rigorously defined—research emphasizes the necessity for empirically validated, domain-specific, and user-accepted representations (Kovalerchuk et al., 2020, Holmberg, 2022).
Distributed and Federated ML: Explaining models trained on distributed data requires adapted approaches (e.g., DC-SHAP (Bogdanova et al., 2022)) to ensure consistent and privacy-preserving feature attributions.

6. Theoretical and Sociotechnical Frontiers

Theoretical analysis and the philosophy of science play roles in situating the capabilities and limits of explainable ML:

Limits of Inductive Explanations: Explanations produced by black-box neural networks must be viewed as post-hoc evidence or "hints"—not strict causal or scientific explanations in the deductive-nomological sense (Holmberg, 2022).
Causal Interpretability: There is an active push toward integrating causality into explanations and developing methods that not only describe associations but also expose causal mechanisms behind predictions (Galinkin, 2022).
Human Factors and Trust: Misalignment between mathematically correct but misleading explanations and human expectations can lead to overtrust, poor contestability, or adversarial misuse (Galinkin, 2022, Holmberg, 2022). Future research emphasizes the need for benchmarking, robustness, and contestability in explanation methods.

7. Summary Table of Selected Explainability Methods

Method	Mathematical Principle	Key Properties & Use Cases
Tree Surrogate	Fitted global tree $h_{\text{tree}}(X) \approx g(X)$	Global overview, feature importance, low-fidelity; error metrics required (Hall, 2018)
PD/ICE	Marginal/individual plotting, $PD_j(x) = E[g(x\|x_j)]$	Average and instance-level effects, identifies interaction heterogeneity (Hall, 2018)
LIME	Local surrogate, $h_{\mathrm{GLM}}$ by minimizing loss with locality weights	Local explanation, high interpretability, fidelity can be low; error must be assessed per-instance (Hall, 2018, Salih et al., 2023)
SHAP	Shapley values, combinatorial, $\phi_j$ as marginal contribution	Additive, locally exact, globally consistent, theoretically unique (Hall, 2018, Salih et al., 2023)
Counterfactual	Optimization to find minimal input change for different output	Actionable, but may lack plausibility or feasibility (Bhatt et al., 2019)
Integrated Gradients	Path integral of gradients	Suitable for monotonic/pairwise-ordered attributions (Chen, 2023)
Influence Function	Model parameter sensitivity to train points	Training data audit, computationally demanding, often flags outliers (Bhatt et al., 2019)

Explainable machine learning methods are evolving to meet the dual requirements of predictive accuracy and model transparency. Contemporary research continues to expand the repertoire of theoretically grounded, practically robust methods tailored to scientific, industrial, and ethical deployments, while recognizing emerging challenges around robustness, personalization, fairness, and domain alignment.