Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models (2402.04614v3)

Published 7 Feb 2024 in cs.CL

Abstract: LLMs are deployed as powerful tools for several NLP applications. Recent works show that modern LLMs can generate self-explanations (SEs), which elicit their intermediate reasoning steps for explaining their behavior. Self-explanations have seen widespread adoption owing to their conversational and plausible nature. However, there is little to no understanding of their faithfulness. In this work, we discuss the dichotomy between faithfulness and plausibility in SEs generated by LLMs. We argue that while LLMs are adept at generating plausible explanations -- seemingly logical and coherent to human users -- these explanations do not necessarily align with the reasoning processes of the LLMs, raising concerns about their faithfulness. We highlight that the current trend towards increasing the plausibility of explanations, primarily driven by the demand for user-friendly interfaces, may come at the cost of diminishing their faithfulness. We assert that the faithfulness of explanations is critical in LLMs employed for high-stakes decision-making. Moreover, we emphasize the need for a systematic characterization of faithfulness-plausibility requirements of different real-world applications and ensure explanations meet those needs. While there are several approaches to improving plausibility, improving faithfulness is an open challenge. We call upon the community to develop novel methods to enhance the faithfulness of self explanations thereby enabling transparent deployment of LLMs in diverse high-stakes settings.

PDF HTML Abstract

Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from LLMs

The paper "Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from LLMs" presents an analytical discourse on the nuanced interplay between faithfulness and plausibility in the context of self-explanations generated by LLMs. The research focuses on the critical examination of the reliability of these self-generated explanations, which are increasingly utilized to elucidate the decision-making processes of LLMs in various applications.

Key Insights and Findings

The authors assert that LLMs are proficient in producing plausible explanations that resonate well with human-like logical structures. Such explanations can enhance user interaction by being contextually relevant and convincingly logical. However, this apparent advantage poses a fundamental challenge: plausibility does not equate to faithfulness. An explanation is deemed plausible if it appears coherent and logical to human evaluators, while faithfulness pertains to the explanation accurately reflecting the actual reasoning and internal processes of the model. The crux of the argument is that LLM-generated explanations, even if plausible, do not necessarily reveal the true computational rationale behind the model’s outputs, thus questioning their reliability.

The paper emphasizes that the growing trend of prioritizing plausible explanations, driven by the demand for more user-friendly AI interfaces, may undermine the critical requirement for faithfulness, especially in high-stakes decision-making scenarios such as healthcare, finance, and legal applications. In these fields, incorrect reasoning or deceptive explanations can lead to adverse outcomes.

Implications and Future Directions

The dichotomy between plausibility and faithfulness has significant implications for both the practical deployment and theoretical development of LLMs. Practically, when deploying LLMs in sensitive areas, ensuring the faithfulness of explanations is paramount. Users must be able to trust that the rationale given aligns with the model’s internal decision pathways, avoiding misplaced confidence in the AI's outputs. Theoretically, this research suggests a need for novel methodologies that focus explicitly on enhancing the faithfulness of LLM self-explanations.

The paper calls for the AI research community to develop systematic frameworks and benchmarks that can rigorously assess the faithfulness of explanations, beyond mere surface-level plausibility. It underlines the necessity for interdisciplinary research efforts aimed at integrating robust interpretability mechanisms that can dissect and reveal the genuine decision-making processes within LLMs.

Conclusion

In conclusion, the examination of faithfulness versus plausibility in this paper underscores a vital concern in the field of AI: the need to balance human-friendly interaction with truthful model transparency. As AI systems, especially LLMs, permeate deeper into critical sectors, ensuring that their explanations are not just superficially appealing but fundamentally truthful remains a challenge and a necessity. Future research is urged to take steps towards creating LLM systems whose explanations are both reliable and interpretable, thereby fostering applications that are both innovative and dependable. This entails a concerted effort to bridge the existing gap between what LLMs say and how they truly process information, a task that is central to advancing trustworthy AI technology.