Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Probabilities Also Matter: A More Faithful Metric for Faithfulness of Free-Text Explanations in Large Language Models (2404.03189v2)

Published 4 Apr 2024 in cs.CL and cs.AI

Abstract: In order to oversee advanced AI systems, it is important to understand their underlying decision-making process. When prompted, LLMs can provide natural language explanations or reasoning traces that sound plausible and receive high ratings from human annotators. However, it is unclear to what extent these explanations are faithful, i.e., truly capture the factors responsible for the model's predictions. In this work, we introduce Correlational Explanatory Faithfulness (CEF), a metric that can be used in faithfulness tests based on input interventions. Previous metrics used in such tests take into account only binary changes in the predictions. Our metric accounts for the total shift in the model's predicted label distribution, more accurately reflecting the explanations' faithfulness. We then introduce the Correlational Counterfactual Test (CCT) by instantiating CEF on the Counterfactual Test (CT) from Atanasova et al. (2023). We evaluate the faithfulness of free-text explanations generated by few-shot-prompted LLMs from the Llama2 family on three NLP tasks. We find that our metric measures aspects of faithfulness which the CT misses.

An Analysis of Correlational Explanatory Faithfulness in LLMs

The paper "The Probabilities Also Matter: A More Faithful Metric for Faithfulness of Free-Text Explanations in LLMs" by Noah Y. Siegel et al. introduces a novel approach to evaluating the faithfulness of free-text explanations generated by LLMs. This research seeks greater insight into the decision-making processes of such models, a crucial requirement for their deployment in high-stakes environments. The paper critiques existing faithfulness metrics and proposes Correlational Explanatory Faithfulness (CEF) and the Correlational Counterfactual Test (CCT) as more nuanced and informative alternatives.

Key Contributions

  1. Objective Critique of Binary Metrics: The authors identify a significant limitation in prevailing faithfulness metrics, which largely depend on binary indicators of prediction changes following input interventions. This binary treatment lacks the granularity needed to capture nuances in prediction shifts.
  2. Introduction of CEF: The paper proposes Correlational Explanatory Faithfulness (CEF), a metric that evaluates faithfulness based on the correlation between prediction impact and explanatory mentions. CEF provides a more continuous measure of faithfulness than existing metrics. It considers both the degree of impact of input interventions and the frequency of explanatory mentions of impactful factors, thereby addressing the need for explanations that highlight significant factors over trivial ones.
  3. Correlational Counterfactual Test (CCT): Building on the Counterfactual Test (CT), the authors propose CCT as a refined tool to measure explanatory faithfulness. CCT uses statistical distances, specifically Total Variation Distance (TVD), to quantify the shift in predicted label distributions following interventions. This offers a more comprehensive view of model behavior than binary label change detection.
  4. Empirical Evaluation: The authors apply CCT to evaluate free-text explanations generated by few-shot-prompted LLMs from the Llama2 family across three NLP tasks: e-SNLI, ComVE, and ECQA. The CCT reveals faithfulness trends that previous metrics may have missed, indicating gaps in prior interpretation frameworks.

Implications and Future Directions

The introduction of CEF and CCT signals a significant shift toward more thorough assessments of explanation faithfulness in LLMs. This advancement has practical ramifications for domains such as healthcare and criminal justice, where understanding AI model reasoning is pivotal. By offering a more granular and correlate-based analysis, CCT could enhance oversight mechanisms for AI, fostering greater trust in AI's deployment in sensitive sectors.

Theoretically, this work underscores the importance of capturing non-binary dynamics in AI interpretability research. The pursuit of refining metrics like CCT could lead to deeper insights into the underlying mechanisms of LLMs and improve the generalizability of interpretability findings across diverse AI systems.

Looking forward, the research opens pathways for the evaluation of instruction-tuned models, as well as the exploration of explanation generation strategies such as question decomposition. A continued focus on the interaction between prediction impact and explanation structure could further solidify the nexus between transparency and model reliability. Enhanced metrics like CCT may catalyze broader efforts to standardize interpretability assessments, potentially setting benchmarks for future contributions in the field.

In conclusion, the development and application of CEF and CCT signify a substantial contribution to AI interpretability. This research provides a more faithful reflection of model reasoning processes, challenging researchers to reconceive the metrics guiding explainability in AI. As models evolve, so too must the methods by which we assess their transparency and trustworthiness, a challenge this paper addresses with commendable depth.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Sanity checks for saliency maps. In Neural Information Processing Systems.
  2. Explanations for CommonsenseQA: New Dataset and Models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3050–3065, Online. Association for Computational Linguistics.
  3. Faithfulness tests for natural language explanations. ACL.
  4. The struggles of feature-based explanations: Shapley values vs. minimal sufficient subsets. In AAAI 2021 Workshop on Explainable Agency in Artificial Intelligence.
  5. e-SNLI: Natural language inference with natural language explanations. NeurIPS.
  6. Interpretable by design: Learning predictors by composing interpretable queries. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6):7430–7443.
  7. Antonia Creswell and Murray Shanahan. 2022. Faithful reasoning using large language models.
  8. Selection-inference: Exploiting large language models for interpretable logical reasoning. ICLR.
  9. Eraser: A benchmark to evaluate rationalized nlp models. In Annual Meeting of the Association for Computational Linguistics.
  10. On interpretability of artificial neural networks: A survey. IEEE Transactions on Radiation and Plasma Medical Sciences, 5:741–760.
  11. Christiane Fellbaum. 2010. Wordnet. In Theory and applications of ontology: computer applications, pages 231–243. Springer.
  12. Alon Jacovi and Yoav Goldberg. 2020. Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? In Annual Meeting of the Association for Computational Linguistics.
  13. Explaining chest x-ray pathologies in natural language. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2022, pages 701–713, Cham. Springer Nature Switzerland.
  14. Tamera Lanham. 2022. Externalized reasoning oversight: a research direction for language model alignment.
  15. Measuring faithfulness in chain-of-thought reasoning.
  16. The alignment problem from a deep learning perspective.
  17. Huspacy: an industrial-strength hungarian natural language processing toolkit. arXiv preprint arXiv:2201.01956.
  18. Martin F Porter. 2001. Snowball: A language for stemming algorithms.
  19. Question decomposition improves the faithfulness of model-generated reasoning.
  20. Explain yourself! leveraging language models for commonsense reasoning.
  21. Fabien Roger and Ryan Greenblatt. 2023. Preventing language models from hiding their reasoning.
  22. Cynthia Rudin. 2018. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1:206 – 215.
  23. Goal misgeneralization: Why correct specifications aren’t enough for correct goals.
  24. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
  25. Llama 2: Open foundation and fine-tuned chat models.
  26. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. ArXiv, abs/2305.04388.
  27. SemEval-2020 task 4: Commonsense validation and explanation. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pages 307–321, Barcelona (online). International Committee for Computational Linguistics.
  28. Honesty is the best policy: Defining and mitigating ai deception.
  29. Chain-of-thought prompting elicits reasoning in large language models.
  30. Sarah Wiegreffe and Ana Marasović. 2021. Teach me to explain: A review of datasets for explainable natural language processing. In NeurIPS Datasets and Benchmarks.
  31. Measuring association between labels and free-text rationales. In Conference on Empirical Methods in Natural Language Processing.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Noah Y. Siegel (7 papers)
  2. Oana-Maria Camburu (29 papers)
  3. Nicolas Heess (139 papers)
  4. Maria Perez-Ortiz (92 papers)
Citations (5)