Local Explanations and Self-Explanations for Assessing Faithfulness in black-box LLMs (2409.13764v1)

Published 18 Sep 2024 in cs.CL and cs.AI

Abstract: This paper introduces a novel task to assess the faithfulness of LLMs using local perturbations and self-explanations. Many LLMs often require additional context to answer certain questions correctly. For this purpose, we propose a new efficient alternative explainability technique, inspired by the commonly used leave-one-out approach. Using this approach, we identify the sufficient and necessary parts for the LLM to generate correct answers, serving as explanations. We propose a metric for assessing faithfulness that compares these crucial parts with the self-explanations of the model. Using the Natural Questions dataset, we validate our approach, demonstrating its effectiveness in explaining model decisions and assessing faithfulness.