This paper introduces the Clinical Explainable AI Guidelines, a set of five criteria designed to optimize clinical XAI (Explainable Artificial Intelligence) techniques for medical image analysis. These guidelines aim to ensure that explanation techniques are both technically sound and clinically useful. The authors evaluate 16 commonly-used heatmap XAI techniques against these guidelines, revealing their limitations for clinical application.
The Clinical XAI Guidelines consist of the following five criteria:
- G1 Understandability: Explanations should be easily understood by clinical users without requiring technical expertise.
- G2 Clinical relevance: Explanations should align with physicians' clinical decision-making processes and support their clinical reasoning.
- G3 Truthfulness: Explanations should accurately reflect the AI model's decision-making process.
- G4 Informative plausibility: User assessment of explanation plausibility should provide insights into AI decision quality, including potential flaws or biases.
- G5 Computational efficiency: Explanations should be generated within a clinically acceptable timeframe.
The authors conducted a systematic evaluation of 16 heatmap techniques, assessing their adherence to the proposed guidelines across two clinical tasks. The evaluation revealed that while existing heatmap methods generally meet G1 and partially meet G2, they often fail to meet G3 and G4, indicating their inadequacy for clinical use.
The paper also addresses the clinically relevant but technically underexplored problem of multi-modal medical image explanation. To facilitate this, the authors introduce a novel metric called modality-specific feature importance (MSFI) to quantify and automate physicians' assessment of explanation plausibility.
The authors evaluate 16 post-hoc XAI algorithms, which are grouped into gradient-based and perturbation-based methods:
- Gradient-based methods: Gradient, Guided BackProp, GradCAM, Guided GradCAM, DeepLift, InputGradient, Integrated Gradients, Gradient Shap, Deconvolution, Smooth Grad.
- Perturbation-based methods: Occlusion, Feature Ablation, Shapley Value Sampling, Kernel Shap, Feature Permutation, Lime.
Key findings include:
- Heatmap explanations only partially fulfill G2 because they lack descriptions of feature pathology.
- Evaluated heatmap methods did not reliably exhibit G3 Truthfulness on multiple models in the two clinical tasks.
- The examined XAI methods did not meet G4 Informative plausibility on either the glioma or knee task.
The authors used the following metrics to evaluate the models:
$\varphi_m(v)\!=\!\sum_{c \subseteq \mathcal{M} \backslash\{m\} \!\frac{|c| !(M-|c|-1) !}{M !}(v(c \cup\{m\})-v(c))$
where:
- is the modality Shapley value, which is the modality importance score for a modality .
- is the modality-specific performance metric (accuracy for the glioma task and AUC (Area Under the Curve) for the knee task).
- denotes all modality subsets not including modality .
- represents a modality subset.
- is the number of modalities.
The authors also define as:
where:
- $\text{$\Delta$AUPC}(\mathcal{H})$ is the difference of the area under the feature perturbation curve.
- is the area under the feature perturbation curve of an XAI method .
- is the area under the feature perturbation curve of the corresponding baseline .
The authors formulate the MSFI metric as:
$\widehat{\text{MSFI} =\sum_{m} \varphi_m \frac{ \sum_i \mathbbm{1} ( L_m^i >0 ) \odot S_m^i }{ \sum_i S_m^i },$
$\text{MSFI} = \frac{\widehat{\text{MSFI} {\sum_{m} \varphi_m}$
where:
- is the unnormalized MSFI (Modality-Specific Feature Importance) score.
- is the normalized MSFI score.
- is the normalized modality importance value for modality .
- is the human-annotated feature mask for modality at spatial location .
- is the heatmap value for modality at spatial location .
- $\mathbbm{1}(L_m^i > 0)$ is an indicator function that selects heatmap values inside the feature mask.