Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models (2402.04614v3)

Published 7 Feb 2024 in cs.CL

Abstract: LLMs are deployed as powerful tools for several NLP applications. Recent works show that modern LLMs can generate self-explanations (SEs), which elicit their intermediate reasoning steps for explaining their behavior. Self-explanations have seen widespread adoption owing to their conversational and plausible nature. However, there is little to no understanding of their faithfulness. In this work, we discuss the dichotomy between faithfulness and plausibility in SEs generated by LLMs. We argue that while LLMs are adept at generating plausible explanations -- seemingly logical and coherent to human users -- these explanations do not necessarily align with the reasoning processes of the LLMs, raising concerns about their faithfulness. We highlight that the current trend towards increasing the plausibility of explanations, primarily driven by the demand for user-friendly interfaces, may come at the cost of diminishing their faithfulness. We assert that the faithfulness of explanations is critical in LLMs employed for high-stakes decision-making. Moreover, we emphasize the need for a systematic characterization of faithfulness-plausibility requirements of different real-world applications and ensure explanations meet those needs. While there are several approaches to improving plausibility, improving faithfulness is an open challenge. We call upon the community to develop novel methods to enhance the faithfulness of self explanations thereby enabling transparent deployment of LLMs in diverse high-stakes settings.

Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from LLMs

The paper "Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from LLMs" presents an analytical discourse on the nuanced interplay between faithfulness and plausibility in the context of self-explanations generated by LLMs. The research focuses on the critical examination of the reliability of these self-generated explanations, which are increasingly utilized to elucidate the decision-making processes of LLMs in various applications.

Key Insights and Findings

The authors assert that LLMs are proficient in producing plausible explanations that resonate well with human-like logical structures. Such explanations can enhance user interaction by being contextually relevant and convincingly logical. However, this apparent advantage poses a fundamental challenge: plausibility does not equate to faithfulness. An explanation is deemed plausible if it appears coherent and logical to human evaluators, while faithfulness pertains to the explanation accurately reflecting the actual reasoning and internal processes of the model. The crux of the argument is that LLM-generated explanations, even if plausible, do not necessarily reveal the true computational rationale behind the model’s outputs, thus questioning their reliability.

The paper emphasizes that the growing trend of prioritizing plausible explanations, driven by the demand for more user-friendly AI interfaces, may undermine the critical requirement for faithfulness, especially in high-stakes decision-making scenarios such as healthcare, finance, and legal applications. In these fields, incorrect reasoning or deceptive explanations can lead to adverse outcomes.

Implications and Future Directions

The dichotomy between plausibility and faithfulness has significant implications for both the practical deployment and theoretical development of LLMs. Practically, when deploying LLMs in sensitive areas, ensuring the faithfulness of explanations is paramount. Users must be able to trust that the rationale given aligns with the model’s internal decision pathways, avoiding misplaced confidence in the AI's outputs. Theoretically, this research suggests a need for novel methodologies that focus explicitly on enhancing the faithfulness of LLM self-explanations.

The paper calls for the AI research community to develop systematic frameworks and benchmarks that can rigorously assess the faithfulness of explanations, beyond mere surface-level plausibility. It underlines the necessity for interdisciplinary research efforts aimed at integrating robust interpretability mechanisms that can dissect and reveal the genuine decision-making processes within LLMs.

Conclusion

In conclusion, the examination of faithfulness versus plausibility in this paper underscores a vital concern in the field of AI: the need to balance human-friendly interaction with truthful model transparency. As AI systems, especially LLMs, permeate deeper into critical sectors, ensuring that their explanations are not just superficially appealing but fundamentally truthful remains a challenge and a necessity. Future research is urged to take steps towards creating LLM systems whose explanations are both reliable and interpretable, thereby fostering applications that are both innovative and dependable. This entails a concerted effort to bridge the existing gap between what LLMs say and how they truly process information, a task that is central to advancing trustworthy AI technology.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Openxai: Towards a transparent evaluation of model explanations, 2023.
  2. Opt-r: Exploring the role of explanations in finetuning and prompting for reasoning skills of large language models. arXiv, 2023.
  3. Explainability for artificial intelligence in healthcare: a multidisciplinary perspective. BMC Medical Informatics and Decision Making, 20, 11 2020. doi: 10.1186/s12911-020-01332-6.
  4. Language models are few-shot learners. NeurIPS, 2020.
  5. Do models explain themselves? counterfactual simulatability of natural language explanations, 2023a.
  6. Xplainllm: A qa explanation dataset for understanding llm decision-making. arXiv, 2023b.
  7. Large language models in education: Vision and opportunities, 2023.
  8. Agcvt-prompt for sentiment classification: Automatically generating chain of thought and verbalizer in prompt learning. Engineering Applications of Artificial Intelligence, 2024.
  9. Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? In ACL, 2020a.
  10. Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness? In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J. (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  4198–4205, Online, July 2020b. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.386. URL https://aclanthology.org/2020.acl-main.386.
  11. Challenges and applications of large language models. arXiv, 2023.
  12. Llms as factual reasoners: Insights from existing benchmarks and beyond. arXiv, 2023.
  13. Measuring faithfulness in chain-of-thought reasoning, 2023.
  14. Visualizing and understanding neural models in nlp. In NAACL, 2015.
  15. Ai transparency in the age of llms: A human-centered research roadmap, 2023.
  16. Faithful chain-of-thought reasoning, 2023.
  17. Towards faithful model explanation in nlp: A survey, 2024.
  18. Divide and conquer for large language models reasoning. arXiv, 2024.
  19. Chain of thought utilization in large language models and application in nephrology. Medicina, 2024.
  20. Zoom in: An introduction to circuits. Distill, 2020. doi: 10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in.
  21. The art of SOCRATIC QUESTIONING: Recursive thinking with large language models. In Bouamor, H., Pino, J., and Bali, K. (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.255. URL https://aclanthology.org/2023.emnlp-main.255.
  22. Rasal, S. Llm harmony: Multi-agent communication for problem solving. arXiv, 2024.
  23. The benefits of a concise chain of thought on problem-solving in large language models. arXiv, 2024.
  24. Large language models help humans verify truthfulness–except when they are convincingly wrong. arXiv, 2023.
  25. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting, 2023.
  26. Why can large language models generate correct chain-of-thoughts? arXiv, 2023.
  27. Large language models still can’t plan (a benchmark for llms on planning and reasoning about change). arXiv, 2022.
  28. Emergent abilities of large language models. arXiv, 2022.
  29. Chain-of-thought prompting elicits reasoning in large language models, 2023.
  30. Teach me to explain: A review of datasets for explainable natural language processing. In NeurIPS Datasets and Benchmarks, 2021.
  31. Perturbed masking: Parameter-free probing for analyzing and interpreting bert. In ACL, 2020.
  32. The unreliability of explanations in few-shot prompting for textual reasoning. NeurIPS, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Chirag Agarwal (39 papers)
  2. Sree Harsha Tanneru (4 papers)
  3. Himabindu Lakkaraju (88 papers)
Citations (24)
Youtube Logo Streamline Icon: https://streamlinehq.com