Did I Faithfully Say What I Thought? Bridging the Gap Between Neural Activity and Self-Explanations in Large Language Models (2506.09277v2)

Published 10 Jun 2025 in cs.CL

Abstract: LLMs (LLM) have demonstrated the capability of generating free text self Natural Language Explanation (self-NLE) to justify their answers. Despite their logical appearance, self-NLE do not necessarily reflect the LLM actual decision-making process, making such explanations unfaithful. While existing methods for measuring self-NLE faithfulness mostly rely on behavioral tests or computational block identification, none of them examines the neural activity underlying the model's reasoning. This work introduces a novel flexible framework for quantitatively measuring the faithfulness of LLM-generated self-NLE by directly comparing the latter with interpretations of the model's internal hidden states. The proposed framework is versatile and provides deep insights into self-NLE faithfulness by establishing a direct connection between self-NLE and model reasoning. This approach advances the understanding of self-NLE faithfulness and provides building blocks for generating more faithful self-NLE.

PDF Abstract

Bridging Neural Activity and Self-Explanations in LLMs

In recent developments within the field of artificial intelligence, LLMs have shown remarkable proficiency in generating free-text explanations to justify their predictions. Referred to as self-Natural Language Explanations (self-NLEs), these constructs ostensibly provide reasoning behind a model's generated output. However, the faithfulness of these self-NLEs—whether they accurately represent the model's internal decision-making processes—remains an area of concern. The paper "Did I Faithfully Say What I Thought? Bridging the Gap Between Neural Activity and Self-Explanations in LLMs" addresses this issue by proposing a novel framework for measuring the faithfulness of self-NLEs through direct analysis of a model's internal neural states.

Framework Overview

The authors introduce "NeuroFaith," a flexible framework designed to quantify the faithfulness of self-NLEs by comparing them with interpretations derived from the model's hidden states. NeuroFaith concurs that self-NLEs should mirror the internal reasoning pathways within the model, thus rendering explanations more faithful to the underlying computations. The framework comprises three core components: location, circuit, and interpreter.

Location: This defines the specific architectural part of the Transformer model to be examined, particularly focusing on either the residual stream (RS), multi-head attention (MHA), or multi-layer perceptron (MLP). Each choice offers different insights into the information processing of the model.
Circuit: A circuit outlines a sparse, ordered subgraph within the model, focusing on units associated with the prediction task, identified through either manual analysis or automated discovery techniques. Circuits help ensure the right sub-computation within the model is analyzed, thus targeting the relevant neural pathways.
Interpreter: The interpreter translates model hidden states into comprehensible outputs. Concept-based interpretations can be generated through sparse autoencoders, while free-text interpretations leverage methods akin to Selfie and Patchscopes that utilize LLMs for generating textual descriptions of hidden states.

Evaluating Faithfulness

NeuroFaith measures faithfulness by analyzing the consistency between self-NLEs and neural interpretations. Faithfulness can be evaluated locally and globally, with local faithfulness dependent on whether the self-NLE aligns with specific neural interpretations at designated layers and indices. The global measure aggregates local faithfulness across the model's circuitry, offering a more comprehensive view of an explanation's fidelity.

Application in Multi-Hop Reasoning

The paper adopts NeuroFaith in the context of 2-hop reasoning tasks. Here, both theoretical correctness and faithfulness of explanations concerning bridge objects (intermediate entities connecting reasoning steps) are assessed. Useful distinctions arise with regards to cases where predictions are made correctly, such as models functioning reliably, or when shortcut learning occurs (correct predictions without explicit reasoning processes).

Experimental Findings

Experiments conducted using the Wikidata-2-hop dataset and models Gemma-2-2B and Gemma-2-9B demonstrate varying degrees of faithfulness and correctness. Notably, larger models tend to exhibit better predictive performance but also encounter scenarios of shortcut learning, highlighting complex interactions between model size and reasoning fidelity.

Implications and Future Directions

The implications of this research extend across both practical and theoretical dimensions. By providing a structured methodology for decoding LLM reasoning processes, NeuroFaith can potentially improve model alignment and transparency in decision-making processes. The framework could be adapted to various linguistic tasks, integrating concept-based or alternative explanatory models that progress towards a more explainable AI. Future work might focus on refining circuitry constructs and exploring more sophisticated interpretability tools—to ensure more accurate introspective capabilities within LLM settings. As LLMs continue to expand across diverse applications, robust interpretability will undoubtedly remain a crucial tenet for their responsible deployment and integration.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Milan Bhan (6 papers)
Jean-Noel Vittaut (5 papers)
Nicolas Chesneau (10 papers)
Sarath Chandar (93 papers)
Marie-Jeanne Lesot (22 papers)

Related Papers

Find Related Papers

YouTube

Show All Videos