Papers
Topics
Authors
Recent
Search
2000 character limit reached

Are self-explanations from Large Language Models faithful?

Published 15 Jan 2024 in cs.CL, cs.AI, and cs.LG | (2401.07927v4)

Abstract: Instruction-tuned LLMs excel at many tasks and will even explain their reasoning, so-called self-explanations. However, convincing and wrong self-explanations can lead to unsupported confidence in LLMs, thus increasing risk. Therefore, it's important to measure if self-explanations truly reflect the model's behavior. Such a measure is called interpretability-faithfulness and is challenging to perform since the ground truth is inaccessible, and many LLMs only have an inference API. To address this, we propose employing self-consistency checks to measure faithfulness. For example, if an LLM says a set of words is important for making a prediction, then it should not be able to make its prediction without these words. While self-consistency checks are a common approach to faithfulness, they have not previously been successfully applied to LLM self-explanations for counterfactual, feature attribution, and redaction explanations. Our results demonstrate that faithfulness is explanation, model, and task-dependent, showing self-explanations should not be trusted in general. For example, with sentiment classification, counterfactuals are more faithful for Llama2, feature attribution for Mistral, and redaction for Falcon 40B.

Citations (14)

Summary

  • The paper reveals that self-explanation faithfulness varies across different tasks and LLM models, affecting the reliability of generated explanations.
  • The paper employs self-consistency checks and methods like counterfactuals, importance measures, and redactions to assess interpretability-faithfulness.
  • The paper underscores the need for further research to refine self-explanation techniques and enhance user trust in LLM outputs.

Introduction

LLMs have increasingly become more sophisticated and accessible to the general population, enhancing their utility across numerous applications. To complement their impressive predictive abilities, these models often generate self-explanations for their outputs, aiming to provide users with insights into their reasoning processes. Given that these explanations can influence user trust and reliance on LLMs, evaluating their accuracy—or "interpretability-faithfulness"—is critical. This paper focuses on whether LLMs, like Llama2, Falcon, and Mistral, can explain themselves credibly.

Self-explanations and their Evaluation

LLMs can generate several types of self-explanations: counterfactuals, importance measures, and redactions. Counterfactual explanations involve changing the input minimally to achieve the opposite prediction; importance measures pinpoint critical words needed for prediction; redactions remove all words considered relevant to a prediction. A unique method called 'self-consistency checks' is proposed to measure the interpretability-faithfulness of these explanations. This approach scrutinizes if the elimination or alteration of deemed significant words affects the model's ability to classify or predict, thereby verifying the explanation's accuracy.

Findings on Task and Model Dependence

The research reveals that self-explanation faithfulness varies depending on the task at hand and the model used. Illustratively, counterfactuals were found to be faithful for sentiment classification tasks in IMDB but not for importance measure explanations, with contrasting results for other tasks like bAbI-1. Notably, the variance in faithfulness across different prompt templates was insignificant, implying that certain LLMs can consistently provide non-faithful explanations regardless of the prompt's specific wording.

Conclusion

The study's findings are significant, demonstrating that the faithfulness of an LLM's self-explanations is not only influenced by the tasks they perform but also by inherent model-dependent factors. Consequently, the reliability of LLM explanations should be treated with cautious scrutiny, especially considering their implications on user trust and model understanding. The paper concludes by encouraging further research to fine-tune LLMs for improved interpretability-faithfulness and suggesting the exploration of additional self-explanation approaches.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 5 tweets with 123 likes about this paper.