Are self-explanations from Large Language Models faithful? (2401.07927v4)

Published 15 Jan 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Instruction-tuned LLMs excel at many tasks and will even explain their reasoning, so-called self-explanations. However, convincing and wrong self-explanations can lead to unsupported confidence in LLMs, thus increasing risk. Therefore, it's important to measure if self-explanations truly reflect the model's behavior. Such a measure is called interpretability-faithfulness and is challenging to perform since the ground truth is inaccessible, and many LLMs only have an inference API. To address this, we propose employing self-consistency checks to measure faithfulness. For example, if an LLM says a set of words is important for making a prediction, then it should not be able to make its prediction without these words. While self-consistency checks are a common approach to faithfulness, they have not previously been successfully applied to LLM self-explanations for counterfactual, feature attribution, and redaction explanations. Our results demonstrate that faithfulness is explanation, model, and task-dependent, showing self-explanations should not be trusted in general. For example, with sentiment classification, counterfactuals are more faithful for Llama2, feature attribution for Mistral, and redaction for Falcon 40B.

PDF Abstract

Introduction

LLMs have increasingly become more sophisticated and accessible to the general population, enhancing their utility across numerous applications. To complement their impressive predictive abilities, these models often generate self-explanations for their outputs, aiming to provide users with insights into their reasoning processes. Given that these explanations can influence user trust and reliance on LLMs, evaluating their accuracy—or "interpretability-faithfulness"—is critical. This paper focuses on whether LLMs, like Llama2, Falcon, and Mistral, can explain themselves credibly.

Self-explanations and their Evaluation

LLMs can generate several types of self-explanations: counterfactuals, importance measures, and redactions. Counterfactual explanations involve changing the input minimally to achieve the opposite prediction; importance measures pinpoint critical words needed for prediction; redactions remove all words considered relevant to a prediction. A unique method called 'self-consistency checks' is proposed to measure the interpretability-faithfulness of these explanations. This approach scrutinizes if the elimination or alteration of deemed significant words affects the model's ability to classify or predict, thereby verifying the explanation's accuracy.

Findings on Task and Model Dependence

The research reveals that self-explanation faithfulness varies depending on the task at hand and the model used. Illustratively, counterfactuals were found to be faithful for sentiment classification tasks in IMDB but not for importance measure explanations, with contrasting results for other tasks like bAbI-1. Notably, the variance in faithfulness across different prompt templates was insignificant, implying that certain LLMs can consistently provide non-faithful explanations regardless of the prompt's specific wording.

Conclusion

The paper's findings are significant, demonstrating that the faithfulness of an LLM's self-explanations is not only influenced by the tasks they perform but also by inherent model-dependent factors. Consequently, the reliability of LLM explanations should be treated with cautious scrutiny, especially considering their implications on user trust and model understanding. The paper concludes by encouraging further research to fine-tune LLMs for improved interpretability-faithfulness and suggesting the exploration of additional self-explanation approaches.