The Probabilities Also Matter: A More Faithful Metric for Faithfulness of Free-Text Explanations in Large Language Models (2404.03189v2)

Published 4 Apr 2024 in cs.CL and cs.AI

Abstract: In order to oversee advanced AI systems, it is important to understand their underlying decision-making process. When prompted, LLMs can provide natural language explanations or reasoning traces that sound plausible and receive high ratings from human annotators. However, it is unclear to what extent these explanations are faithful, i.e., truly capture the factors responsible for the model's predictions. In this work, we introduce Correlational Explanatory Faithfulness (CEF), a metric that can be used in faithfulness tests based on input interventions. Previous metrics used in such tests take into account only binary changes in the predictions. Our metric accounts for the total shift in the model's predicted label distribution, more accurately reflecting the explanations' faithfulness. We then introduce the Correlational Counterfactual Test (CCT) by instantiating CEF on the Counterfactual Test (CT) from Atanasova et al. (2023). We evaluate the faithfulness of free-text explanations generated by few-shot-prompted LLMs from the Llama2 family on three NLP tasks. We find that our metric measures aspects of faithfulness which the CT misses.

PDF HTML Abstract

An Analysis of Correlational Explanatory Faithfulness in LLMs

The paper "The Probabilities Also Matter: A More Faithful Metric for Faithfulness of Free-Text Explanations in LLMs" by Noah Y. Siegel et al. introduces a novel approach to evaluating the faithfulness of free-text explanations generated by LLMs. This research seeks greater insight into the decision-making processes of such models, a crucial requirement for their deployment in high-stakes environments. The paper critiques existing faithfulness metrics and proposes Correlational Explanatory Faithfulness (CEF) and the Correlational Counterfactual Test (CCT) as more nuanced and informative alternatives.

Key Contributions

Objective Critique of Binary Metrics: The authors identify a significant limitation in prevailing faithfulness metrics, which largely depend on binary indicators of prediction changes following input interventions. This binary treatment lacks the granularity needed to capture nuances in prediction shifts.
Introduction of CEF: The paper proposes Correlational Explanatory Faithfulness (CEF), a metric that evaluates faithfulness based on the correlation between prediction impact and explanatory mentions. CEF provides a more continuous measure of faithfulness than existing metrics. It considers both the degree of impact of input interventions and the frequency of explanatory mentions of impactful factors, thereby addressing the need for explanations that highlight significant factors over trivial ones.
Correlational Counterfactual Test (CCT): Building on the Counterfactual Test (CT), the authors propose CCT as a refined tool to measure explanatory faithfulness. CCT uses statistical distances, specifically Total Variation Distance (TVD), to quantify the shift in predicted label distributions following interventions. This offers a more comprehensive view of model behavior than binary label change detection.
Empirical Evaluation: The authors apply CCT to evaluate free-text explanations generated by few-shot-prompted LLMs from the Llama2 family across three NLP tasks: e-SNLI, ComVE, and ECQA. The CCT reveals faithfulness trends that previous metrics may have missed, indicating gaps in prior interpretation frameworks.

Implications and Future Directions

The introduction of CEF and CCT signals a significant shift toward more thorough assessments of explanation faithfulness in LLMs. This advancement has practical ramifications for domains such as healthcare and criminal justice, where understanding AI model reasoning is pivotal. By offering a more granular and correlate-based analysis, CCT could enhance oversight mechanisms for AI, fostering greater trust in AI's deployment in sensitive sectors.

Theoretically, this work underscores the importance of capturing non-binary dynamics in AI interpretability research. The pursuit of refining metrics like CCT could lead to deeper insights into the underlying mechanisms of LLMs and improve the generalizability of interpretability findings across diverse AI systems.

Looking forward, the research opens pathways for the evaluation of instruction-tuned models, as well as the exploration of explanation generation strategies such as question decomposition. A continued focus on the interaction between prediction impact and explanation structure could further solidify the nexus between transparency and model reliability. Enhanced metrics like CCT may catalyze broader efforts to standardize interpretability assessments, potentially setting benchmarks for future contributions in the field.

In conclusion, the development and application of CEF and CCT signify a substantial contribution to AI interpretability. This research provides a more faithful reflection of model reasoning processes, challenging researchers to reconceive the metrics guiding explainability in AI. As models evolve, so too must the methods by which we assess their transparency and trustworthiness, a challenge this paper addresses with commendable depth.