Guidelines and Evaluation of Clinical Explainable AI in Medical Image Analysis (2202.10553v3)

Published 16 Feb 2022 in cs.LG, cs.AI, cs.CV, and eess.IV

Abstract: Explainable artificial intelligence (XAI) is essential for enabling clinical users to get informed decision support from AI and comply with evidence-based medical practice. Applying XAI in clinical settings requires proper evaluation criteria to ensure the explanation technique is both technically sound and clinically useful, but specific support is lacking to achieve this goal. To bridge the research gap, we propose the Clinical XAI Guidelines that consist of five criteria a clinical XAI needs to be optimized for. The guidelines recommend choosing an explanation form based on Guideline 1 (G1) Understandability and G2 Clinical relevance. For the chosen explanation form, its specific XAI technique should be optimized for G3 Truthfulness, G4 Informative plausibility, and G5 Computational efficiency. Following the guidelines, we conducted a systematic evaluation on a novel problem of multi-modal medical image explanation with two clinical tasks, and proposed new evaluation metrics accordingly. Sixteen commonly-used heatmap XAI techniques were evaluated and found to be insufficient for clinical use due to their failure in G3 and G4. Our evaluation demonstrated the use of Clinical XAI Guidelines to support the design and evaluation of clinically viable XAI.

PDF Abstract

This paper introduces the Clinical Explainable AI Guidelines, a set of five criteria designed to optimize clinical XAI (Explainable Artificial Intelligence) techniques for medical image analysis. These guidelines aim to ensure that explanation techniques are both technically sound and clinically useful. The authors evaluate 16 commonly-used heatmap XAI techniques against these guidelines, revealing their limitations for clinical application.

The Clinical XAI Guidelines consist of the following five criteria:

G1 Understandability: Explanations should be easily understood by clinical users without requiring technical expertise.
G2 Clinical relevance: Explanations should align with physicians' clinical decision-making processes and support their clinical reasoning.
G3 Truthfulness: Explanations should accurately reflect the AI model's decision-making process.
G4 Informative plausibility: User assessment of explanation plausibility should provide insights into AI decision quality, including potential flaws or biases.
G5 Computational efficiency: Explanations should be generated within a clinically acceptable timeframe.

The authors conducted a systematic evaluation of 16 heatmap techniques, assessing their adherence to the proposed guidelines across two clinical tasks. The evaluation revealed that while existing heatmap methods generally meet G1 and partially meet G2, they often fail to meet G3 and G4, indicating their inadequacy for clinical use.

The paper also addresses the clinically relevant but technically underexplored problem of multi-modal medical image explanation. To facilitate this, the authors introduce a novel metric called modality-specific feature importance (MSFI) to quantify and automate physicians' assessment of explanation plausibility.

The authors evaluate 16 post-hoc XAI algorithms, which are grouped into gradient-based and perturbation-based methods:

Gradient-based methods: Gradient, Guided BackProp, GradCAM, Guided GradCAM, DeepLift, Input $\times$ Gradient, Integrated Gradients, Gradient Shap, Deconvolution, Smooth Grad.
Perturbation-based methods: Occlusion, Feature Ablation, Shapley Value Sampling, Kernel Shap, Feature Permutation, Lime.

Key findings include:

Heatmap explanations only partially fulfill G2 because they lack descriptions of feature pathology.
Evaluated heatmap methods did not reliably exhibit G3 Truthfulness on multiple models in the two clinical tasks.
The examined XAI methods did not meet G4 Informative plausibility on either the glioma or knee task.

The authors used the following metrics to evaluate the models:

$\varphi_m(v)\!=\!\sum_{c \subseteq \mathcal{M} \backslash\{m\} \!\frac{|c| !(M-|c|-1) !}{M !}(v(c \cup\{m\})-v(c))$

where:

$\varphi_m$ is the modality Shapley value, which is the modality importance score for a modality $m$ .
$v$ is the modality-specific performance metric (accuracy for the glioma task and AUC (Area Under the Curve) for the knee task).
$\mathcal{M} \backslash\{m\}$ denotes all modality subsets $\mathcal{M}$ not including modality $m$ .
$c$ represents a modality subset.
$M$ is the number of modalities.

The authors also define $\Delta AUPC$ as:

$\Delta\text{AUPC}(\mathcal{H}) = \text{AUPC}(\mathcal{H}_b) - \text{AUPC}(\mathcal{H})$

where:

$\text{$\Delta$AUPC}(\mathcal{H})$ is the difference of the area under the feature perturbation curve.
$\text{AUPC}(\mathcal{H})$ is the area under the feature perturbation curve of an XAI method $\mathcal{H}$ .
$\text{AUPC}(\mathcal{H}_b)$ is the area under the feature perturbation curve of the corresponding baseline $\mathcal{H}_b$ .

The authors formulate the MSFI metric as:

$\widehat{\text{MSFI} =\sum_{m} \varphi_m \frac{ \sum_i \mathbbm{1} ( L_m^i >0 ) \odot S_m^i }{ \sum_i S_m^i },$

$\text{MSFI} = \frac{\widehat{\text{MSFI} {\sum_{m} \varphi_m}$

where:

$\widehat{\text{MSFI}}$ is the unnormalized MSFI (Modality-Specific Feature Importance) score.
$\text{MSFI}$ is the normalized MSFI score.
$\varphi_m$ is the normalized modality importance value for modality $m$ .
$L_m^i$ is the human-annotated feature mask for modality $m$ at spatial location $i$ .
$S_m^i$ is the heatmap value for modality $m$ at spatial location $i$ .
$\mathbbm{1}(L_m^i > 0)$ is an indicator function that selects heatmap values inside the feature mask.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Weina Jin (11 papers)
Xiaoxiao Li (144 papers)
Mostafa Fatehi (2 papers)
Ghassan Hamarneh (64 papers)

Citations (75)

View on Semantic Scholar

Guidelines and Evaluation of Clinical Explainable AI in Medical Image Analysis (2202.10553v3)

Related Papers