Proxy Tasks and Subjective Measures Can Be Misleading in Evaluating Explainable AI Systems (2001.08298v1)

Published 22 Jan 2020 in cs.AI and cs.HC

Abstract: Explainable artificially intelligent (XAI) systems form part of sociotechnical systems, e.g., human+AI teams tasked with making decisions. Yet, current XAI systems are rarely evaluated by measuring the performance of human+AI teams on actual decision-making tasks. We conducted two online experiments and one in-person think-aloud study to evaluate two currently common techniques for evaluating XAI systems: (1) using proxy, artificial tasks such as how well humans predict the AI's decision from the given explanations, and (2) using subjective measures of trust and preference as predictors of actual performance. The results of our experiments demonstrate that evaluations with proxy tasks did not predict the results of the evaluations with the actual decision-making tasks. Further, the subjective measures on evaluations with actual decision-making tasks did not predict the objective performance on those same tasks. Our results suggest that by employing misleading evaluation methods, our field may be inadvertently slowing its progress toward developing human+AI teams that can reliably perform better than humans or AIs alone.

PDF Abstract

Evaluating Explainable AI Systems Through Reliable Measures

The paper entitled "Proxy Tasks and Subjective Measures Can Be Misleading in Evaluating Explainable AI Systems" by Buçinca et al. explores the efficacy of current evaluation methodologies for Explainable AI (XAI) systems within the context of sociotechnical systems, specifically human+AI teams tasked with decision-making. The authors argue that the prevalent use of proxy tasks and subjective evaluations may not faithfully reflect the actual performance of these systems in real-world scenarios, potentially stagnating progress in the field.

Key Insights and Experimental Design

The paper is built around two critical insights:

Proxy Tasks Fallacy: Proxy tasks that assess how well humans can predict an AI's decisions are often used to evaluate the interpretability of XAI systems. However, these tasks compel users to engage with the AI's explanations more rigorously than they might in genuine decision-making situations. Consequently, results from such tasks can misrepresent a system's actual effectiveness in practice.
Subjective Measures Insufficiency: The reliance on subjective measures such as trust or preference for evaluating XAI systems overlooks an important aspect of performance assessment. Users' preferences do not always translate to improved decision-making performance, questioning the validity of these measures as standalone evaluators of system efficiency.

To support their position, the authors conducted two online experiments and an in-person paper involving a nutrition-related AI decision support system. The experiments contrasted outcomes obtained from proxy tasks versus actual decision-making tasks and examined the relationship between subjective assessments and observable performance.

Experimental Findings

When using proxy tasks, participants demonstrated a higher trust in and preference for inductive explanations (example-based) compared to deductive (rule-based) explanations. However, this did not correlate with improved performance in actual decision-making tasks.
In tasks requiring genuine decision-making, users performed better in recognizing AI errors with inductive explanations, despite expressing more trust for deductive explanations. This dichotomy between preference and performance underscores the limitations of subjective measures.
The in-person think-aloud paper, although providing valuable qualitative insights, suggested that verbalizing thought processes might artificially inflate cognitive engagement levels, aligning results closer to those observed in proxy tasks.

Implications and Future Directions

The findings underscore the importance of aligning evaluation methodologies with real-world use and expectations. As XAI systems increasingly become integral to high-stakes decision-making environments, such as healthcare and finance, ensuring these evaluations reflect true utility and effectiveness becomes paramount.

For AI researchers and developers, the paper advocates for:

Revisiting Evaluation Metrics: Moving beyond proxy tasks and incorporating assessments that accurately capture performance in authentic decision contexts. Task design should enable the measurement of actual human+AI collaborative performance.
Balanced Use of Subjective Measures: While user feedback on system transparency can inform improvements, these metrics should be supplementary to objective performance data.
Attention to Cognitive Load: Understanding how different explanation types impact cognitive effort can guide the design of explanations that strike the optimal balance between informativeness and user engagement.

To advance the efficacy of explainable AI, future studies should focus on developing reliable evaluation frameworks that integrate insights from cognitive science and human-computer interaction. Furthermore, such frameworks should account for the diversity of real-world application settings, acknowledging that different tasks may necessitate distinct interaction modalities and explanation styles.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Zana Buçinca (9 papers)
Phoebe Lin (4 papers)
Krzysztof Z. Gajos (15 papers)
Elena L. Glassman (19 papers)

Citations (244)

View on Semantic Scholar

Proxy Tasks and Subjective Measures Can Be Misleading in Evaluating Explainable AI Systems (2001.08298v1)

Evaluating Explainable AI Systems Through Reliable Measures

Key Insights and Experimental Design

Experimental Findings

Implications and Future Directions

Related Papers