Introduction
LLMs have increasingly become more sophisticated and accessible to the general population, enhancing their utility across numerous applications. To complement their impressive predictive abilities, these models often generate self-explanations for their outputs, aiming to provide users with insights into their reasoning processes. Given that these explanations can influence user trust and reliance on LLMs, evaluating their accuracy—or "interpretability-faithfulness"—is critical. This paper focuses on whether LLMs, like Llama2, Falcon, and Mistral, can explain themselves credibly.
Self-explanations and their Evaluation
LLMs can generate several types of self-explanations: counterfactuals, importance measures, and redactions. Counterfactual explanations involve changing the input minimally to achieve the opposite prediction; importance measures pinpoint critical words needed for prediction; redactions remove all words considered relevant to a prediction. A unique method called 'self-consistency checks' is proposed to measure the interpretability-faithfulness of these explanations. This approach scrutinizes if the elimination or alteration of deemed significant words affects the model's ability to classify or predict, thereby verifying the explanation's accuracy.
Findings on Task and Model Dependence
The research reveals that self-explanation faithfulness varies depending on the task at hand and the model used. Illustratively, counterfactuals were found to be faithful for sentiment classification tasks in IMDB but not for importance measure explanations, with contrasting results for other tasks like bAbI-1. Notably, the variance in faithfulness across different prompt templates was insignificant, implying that certain LLMs can consistently provide non-faithful explanations regardless of the prompt's specific wording.
Conclusion
The paper's findings are significant, demonstrating that the faithfulness of an LLM's self-explanations is not only influenced by the tasks they perform but also by inherent model-dependent factors. Consequently, the reliability of LLM explanations should be treated with cautious scrutiny, especially considering their implications on user trust and model understanding. The paper concludes by encouraging further research to fine-tune LLMs for improved interpretability-faithfulness and suggesting the exploration of additional self-explanation approaches.