Fool Me Once? Contrasting Textual and Visual Explanations in a Clinical Decision-Support Setting (2410.12284v2)

Published 16 Oct 2024 in cs.HC, cs.CL, and cs.CV

Abstract: The growing capabilities of AI models are leading to their wider use, including in safety-critical domains. Explainable AI (XAI) aims to make these models safer to use by making their inference process more transparent. However, current explainability methods are seldom evaluated in the way they are intended to be used: by real-world end users. To address this, we conducted a large-scale user study with 85 healthcare practitioners in the context of human-AI collaborative chest X-ray analysis. We evaluated three types of explanations: visual explanations (saliency maps), natural language explanations, and a combination of both modalities. We specifically examined how different explanation types influence users depending on whether the AI advice and explanations are factually correct. We find that text-based explanations lead to significant over-reliance, which is alleviated by combining them with saliency maps. We also observe that the quality of explanations, that is, how much factually correct information they entail, and how much this aligns with AI correctness, significantly impacts the usefulness of the different explanation types.

Authors (10)

Maxime Kayser (5 papers)
Bayar Menzat (2 papers)
Cornelius Emde (7 papers)
Bogdan Bercean (1 paper)
Alex Novak (1 paper)
Abdala Espinosa (1 paper)
Bartlomiej W. Papiez (102 papers)
Susanne Gaube (1 paper)
Thomas Lukasiewicz (125 papers)
Oana-Maria Camburu (29 papers)

Summary

Overview of "Fool Me Once? Contrasting Textual and Visual Explanations in a Clinical Decision-Support Setting"

This paper investigates the efficacy of Explainable AI (XAI) in the context of clinical decision-support systems (CDSS), focusing on a human-AI collaboration setting for chest X-ray analysis. The authors conducted a comprehensive user paper with 85 healthcare practitioners to evaluate three types of explanations: visual (saliency maps), natural language explanations (NLEs), and their combination. The aim was to assess how these explanations affect user reliance and decision-making accuracy when AI advice is factually correct or incorrect.

Key Findings

The paper provides several insights into the interaction between explanation types and user responses:

Overreliance on Textual Explanations: The results indicate a concerning overreliance on NLEs. Practitioners are more inclined to trust AI predictions when accompanied by text-based explanations, even when the AI advice is incorrect. This suggests a persuasive element in language-based interfaces, aligning with findings that such interfaces can humanize AI systems.
Combination of Visual and Textual Explanations: By integrating NLEs with saliency maps, users could more accurately discern AI correctness. This combination proved to be the most effective in enhancing user performance when both AI advice and explanations were factually correct.
Critical Alignment of Explanation Correctness: A significant determinant of usefulness is the alignment between explanation correctness and AI prediction accuracy. Explanations misaligned with AI correctness—either providing incorrect justifications for correct predictions or vice versa—were detrimental to user decisions.

Implications and Future Directions

The research highlights the nuanced roles of different explanation modalities in XAI, especially in safety-critical domains like healthcare. The findings raise concerns over uncritical reliance on language-based explanations due to their perceived persuasiveness. This necessitates development in AI models to balance human-like interactions and robust, verifiable explanations.

Future directions should consider:

Continuous Improvement in XAI Evaluation: The paper underscores the importance of evaluating XAI tools not merely by model transparency but by actual improvements in human-AI team performance. More robust metrics could further inform the development of reliable explanation generation.
Reducing Overreliance: Future models might incorporate adaptive systems that assess user confidence and adjust the assertiveness of explanations dynamically, potentially using feedback loops.
Real-World Application Testing: Longitudinal studies in real clinical settings may provide deeper insights into how practitioners interact with AI over time and how explanation modalities might be optimized.

Conclusion

This work sheds light on the complex dynamics of explainability in AI systems in healthcare, advocating for a balanced approach to employing both textual and visual explanations. The research is a crucial step towards safer, more effective integration of AI in clinical environments, emphasizing the need for careful consideration of how explanations can guide user trust and decision-making.

PDF Markdown

Related Papers

Tweets

https://twitter.com/maximek3/status/1850942641674072449