Explainability for Large Language Models

Updated 19 March 2026

Explainability for Large Language Models is a field that defines model transparency by quantifying aspects such as faithfulness, truthfulness, plausibility, and contrastivity.
It employs diverse methods including feature-attribution, attention analysis, counterfactual editing, and mechanistic interpretability to elucidate complex model behaviors.
Empirical evaluations use metrics like human–reasoning agreement, fidelity scores, and sensitivity tests to balance trade-offs between interpretability, stability, and performance.

LLM explainability addresses the challenge of rendering model behaviors, internal mechanisms, and output rationales intelligible to expert humans. Despite sustained advances in model scale, accuracy, and deployment across domains, LLMs remain high-dimensional black boxes whose predictions can be brittle, nontransparent, and difficult to interpret—even for closely related models or fine-tunings. The field of LLM explainability has thus evolved rapidly, with methodological innovations, rigorous statistical frameworks, domain-specific applications, and widening acknowledgment of the epistemic, ethical, and regulatory centrality of explanation in contemporary AI.

1. Formalisms and Foundational Principles

The explainability of LLMs is formally defined along multiple axes that capture what constitutes a satisfactory explanation in technical, human, and regulatory senses. Recent work synthesizes four principal dimensions (Herrera, 18 May 2025):

Faithfulness: The degree to which explanations accurately reflect the internal computations and causal logic of the model. Formally, for model $M$ , input $x$ , and explanation $e$ , faithfulness can be operationalized as $\mathrm{Faith}(M,x,e)=\mathrm{sim}(M(x),\,M(e(x)))$ , where $\mathrm{sim}$ quantifies output or latent similarity under explanation-driven perturbations.
Truthfulness: The alignment of output and explanations with external, ground-truth facts and the absence of hallucinated content. It can be quantified by the fraction of hallucinated claims, e.g., $\mathrm{Truth}(M(x))=1-\frac{\#\{\text{hallucinated claims in }M(x)\}}{\#\{\text{claims in }M(x)\}}$ .
Plausibility: The extent to which explanations are coherent and convincing to human readers, regardless of internal model alignment. Human–Reasoning Agreement (HRA) expresses this as $\mathrm{Plau}(e)=\mathbb{E}_{(x,r)}[\mathrm{sim}(e(x),r)]$ for annotated rationales $r$ .
Contrastivity: Explanations clarify the factors driving a decision $y$ over alternative $y'$ , which can be formalized as $\Delta_e(y,y')=\phi(y)-\phi(y')$ , with $\phi$ representing explanation vectors (e.g., SHAP values).

Tensions between these objectives are formalized as a multi-objective optimization problem: explanations maximizing faithfulness may not maximize plausibility or contrastivity and vice versa. They sometimes entail an explicit trade-off surface (Herrera, 18 May 2025).

Definitions also distinguish local explanations—targeting a single prediction—and global explanations—capturing features or knowledge encoded across the parameter distribution (Zhao et al., 2023, Luo et al., 2024, Palikhe et al., 26 Jun 2025). For an LLM $f:\mathcal{X}\rightarrow\mathbb{R}^C$ and input $x=(x_1,...,x_n)$ , local explanations attribute $R_i(x)$ to each token $x_i$ (such that $\sum_i R_i(x)\approx f(x)$ ), while global importance averages this over a data distribution.

2. Methodological Taxonomy

LLM explainability techniques are categorized by both model architecture (encoder-only, decoder-only, encoder–decoder) and explanatory paradigm (ante-hoc vs. post-hoc; fine-tuning vs. prompting) (Palikhe et al., 26 Jun 2025, Zhao et al., 2023, Mumuni et al., 17 Jan 2025):

Feature-Attribution and Input Saliency

Gradient-based: Scores $s_j=\frac{\partial f(x)}{\partial x_j}$ or $R_j=x_j\cdot\frac{\partial f(x)}{\partial x_j}$ (saliency), with integrated gradients (IG) capturing path-integrated attributions from a baseline (Zhao et al., 2023, Mumuni et al., 17 Jan 2025, Luo et al., 2024).
Perturbation-based: Systematically mask or occlude tokens ( $f_{\setminus\mathcal{M}}(x)$ ), and observe prediction change. Surrogate linear models in LIME and combinatorial marginalizations in SHAP compute local approximations and Shapley values (Dehghani et al., 27 May 2025, Mumuni et al., 17 Jan 2025).
Decomposition: Layer-wise relevance propagation (LRP) propagates class probability backward with per-layer conservation (Bogaert et al., 2024, Bogaert et al., 2024).

Attention and Representation Analysis

Raw attention scores and variants (gradient × attention, attention rollout) are employed for both visualization and interpretation, though their faithfulness as explanations remains debated (Zhao et al., 2023, Mumuni et al., 17 Jan 2025).
Probing: Freeze LLM weights, train simple classifiers to predict linguistic or factual properties from hidden states $h^l(x)$ , yielding layer- and token-specific global insights (Palikhe et al., 26 Jun 2025, Zhao et al., 2024).

Example- and Counterfactual-based Methods

Adversarial/counterfactual editing: Identify minimal input changes that induce label flips (e.g., CREST, Polyjuice), enabling contrastive explanations (Zhao et al., 2023, Randl et al., 2024).
Self-explanations: Chain-of-thought (CoT) prompting generates intermediate reasoning steps, which serve as extractive or counterfactual rationales. Counterfactual self-explanations prompt the model for minimally perturbed texts altering its own predictions and permit direct faithfulness testing (Cahlik et al., 14 Mar 2025, Randl et al., 2024).

Mechanistic Interpretability

Circuit discovery, activation patching, cross-layer tracing: Explicit subnetwork and circuit extraction, activation patching (restoring/ablating intermediate activations), and functional attribution to attention heads/neuron clusters (Zhao et al., 2024, Atakishiyev et al., 20 Oct 2025, Zhang et al., 24 May 2025). Hierarchical frameworks (MSMA) decompose hidden states into nested semantic manifolds—local (word), intermediate (sentence), and global (discourse)—with geometric and information-theoretic alignment across scales (Zhang et al., 24 May 2025).

Model-Agnostic Statistical Approaches

Context-Length Probing: Quantifies importance of individual context tokens in causal LMs by measuring changes in output distributions as context is truncated (Cífka et al., 2022).
SMILE: Input perturbation followed by weighted regression on output shift (measured via Wasserstein/ECDF distances), producing token-level importances and heat maps for any LLM, regardless of internals (Dehghani et al., 27 May 2025).

Ontological and Concept-Bottleneck Techniques

Grounding explanations in curated ontologies enables explicit alignment between model predictions, domain concepts, and logical inference, and supports rigorous rationalization and compliance (Amara et al., 2024). Concept-bottleneck models insert a human-interpretable layer of discrete concepts as an interface to the prediction module (Atakishiyev et al., 20 Oct 2025).

3. Empirical Evaluation, Metrics, and Benchmarks

Quantitative and qualitative assessment of LLM explanations utilizes multiple metrics and benchmark protocols:

Plausibility: Agreement with human-annotated rationales, using IOU, F1, AUPRC, or ranking metrics such as Kendall's $\tau$ (Zhao et al., 2023, Palikhe et al., 26 Jun 2025). Human–Reasoning Agreement (HRA) is central (Herrera, 18 May 2025).
Faithfulness: Deletion/insertion curves, measuring output change as highly-attributed tokens are masked/inserted. Counterfactual validity, perturbation tests, and prediction-flip rates are also routine (Randl et al., 2024, Luo et al., 2024).
Stability/Robustness: Variance (or Jaccard similarity) of explanations across fine-tuning seeds, input perturbations, or independent runs (Bogaert et al., 2024, Dehghani et al., 27 May 2025).
Fidelity/Surrogate Accuracy: R² between local surrogate (e.g., SMILE, LIME) and true output differences under perturbation (Dehghani et al., 27 May 2025).
Model- and Explanation-Level Scores: The BELL benchmark computes aggregate explainability via averaged coherence, uncertainty, and cosine similarity, penalized by hallucination rates (Ahmed et al., 22 Apr 2025).

Standard datasets include ZsRE, CounterFact, TruthfulQA, RealToxicityPrompts, and FairPrism, each probing distinct axes of factuality, harmfulness, or fairness (Luo et al., 2024, Mumuni et al., 17 Jan 2025).

4. Empirical Findings and Core Challenges

Empirical results underline critical phenomena and trade-offs:

Sensitivity to Training Randomness: Both (Bogaert et al., 2024) and (Bogaert et al., 2024) show that even when LLMs achieve nearly identical accuracies, word-level attribution explanations (e.g., via LRP) can exhibit high variance across fine-tuning seeds; by contrast, deterministic feature-based models give stable but less accurate explanations.
Signal-to-Noise in Explanations: Under the (1,1,1) paradigm (word-level, univariate, first-order summaries), transformer LLM explanations have lower between-word signal and far higher within-word noise than simple baselines, yielding SNR $<1$ (≈0.25), such that randomness-induced noise overwhelms any interpretable "signal" (Bogaert et al., 2024). This pattern persists even after normalization or post-processing of heatmaps.
Interpretability-Informativeness Trade-off: Simpler models give higher SNR and sharper attributions but capture less task complexity; LLMs display higher accuracy but less explainable logic at the univariate level. Explanation complexity (e.g., multi-token, multi-channel) often sacrifices plausibility for informativeness (Bogaert et al., 2024, Atakishiyev et al., 20 Oct 2025).
Domain-Specific Effects: In safety-critical fields (healthcare/autonomous driving), rationale extraction and counterfactual stress-testing increase user trust and interpretability. Benchmarking frameworks such as BELL facilitate systematic, cross-model comparisons within explicit domains (Atakishiyev et al., 20 Oct 2025, Ahmed et al., 22 Apr 2025).
Faithfulness and Plausibility Diverge: Extractive explanations can align well with human judgments but fail to reflect model-internal causal logic. Counterfactual self-explanations attain both high faithfulness (prediction flip upon minimal edit) and high similarity to original reasoning (Randl et al., 2024).
Mechanistic and Multi-Scale Approaches Open the "Black Box" at Cost: Mechanistic discovery, manifold alignment, and circuit tracing yield interpretability at component, layer, and cross-scale levels, yet require careful balancing of geometric faithfulness, information preservation, curvature regularization, and computational cost (Zhang et al., 24 May 2025, Zhao et al., 2024).

5. Applications and Utilization of Explanations

Practical applications of LLM explainability include:

Model Debugging and Editing: Identification and causal editing of knowledge neurons or activation pathways enables correction of misinformation, bias suppression, or injection of new facts, e.g., ROME and mass-editing (Mumuni et al., 17 Jan 2025, Zhao et al., 2024).
Controlled Generation and Bias Mitigation: Using attribution and intervention on interpretable features, LLM outputs can be steered toward truthfulness (via ITI), safety (toxic heads), or fairness (demographic bias suppression) (Mumuni et al., 17 Jan 2025, Zhao et al., 2024).
Human-AI Collaboration and Regulation: Explainability pipelines—statistically robust attribution heatmaps (SMILE), rationale chains, and ontologically-anchored outputs—support expert scrutiny, compliance, trust calibration, and contestability in regulatory frameworks (e.g., GDPR, EU AI Act) (Atakishiyev et al., 20 Oct 2025, Herrera, 18 May 2025, Dehghani et al., 27 May 2025, Amara et al., 2024).
Learning and Model Improvement: Explanation-driven regularization, explanation-based prompt tuning, and human-in-the-loop interventions can enhance out-of-distribution robustness and user trust (Zhao et al., 2023, Luo et al., 2024).

6. Open Challenges and Future Research Directions

Ongoing and future research is driven by persistent limitations:

Explanation Stability and Reproducibility: Quantifying and reducing variability in explanations—by ensembling, regularization, or improved sensitivity metrics—is essential for scientific and regulatory acceptance (Bogaert et al., 2024, Bogaert et al., 2024).
Scalability and Efficiency: Faithful mechanistic explanation and circuit discovery for 100B-parameter models remain computationally intractable; efficient surrogate and approximation methods are under exploration (Mumuni et al., 17 Jan 2025, Palikhe et al., 26 Jun 2025).
Unified Faithfulness Metrics: There is as yet no consensus on faithfulness metrics and benchmarks that can function across architectures, explanation types, and real-world conditions (Palikhe et al., 26 Jun 2025).
Audience- and Domain-Adaptive Explanations: Global vs. local, mechanistic vs. narrative, and cognitively accessible vs. technically complete explanations must be custom-tailored to stakeholder roles (auditors, clinicians, regulators, end users) (Herrera, 18 May 2025, Atakishiyev et al., 20 Oct 2025).
Integration with Causal and Symbolic Reasoning: Structural causal models, logic engines, and ontology-driven frameworks promise higher-order explanations, contestability, and intervention capability at the cost of additional complexity (Herrera, 18 May 2025, Amara et al., 2024).
Regulatory and Ethical Compliance: Full compliance with legal requirements for intelligibility, auditability, and redress remains an open alignment challenge, compounded by opacity constraints and "irreducible" model complexity (Herrera, 18 May 2025, Atakishiyev et al., 20 Oct 2025).

Emergent research themes include: multi-scale geometric and information-theoretic frameworks, concept bottlenecks, lifelong and temporal XAI, hybrid neuro-symbolic pipelines, and adversarial/human-in-the-loop benchmarks (Zhang et al., 24 May 2025, Atakishiyev et al., 20 Oct 2025, Palikhe et al., 26 Jun 2025).

7. Representative Comparison: Simplicity, Stability, and SNR

The following table summarizes selected findings on explanation stability, informativeness, and trade-offs between feature-based and LLM-based models (Bogaert et al., 2024, Bogaert et al., 2024):

Model & Explanation	Accuracy (%)	Explanation Variance ( $\sigma^2$ )	SNR ( $S/N$ )
CamemBERT+LRP	82.3	0.045	~0.25
Feature-based (SVM)	68.7	0.012	$\infty$

Feature-based models yield higher signal-to-noise and more stable, sparse heatmaps, while LLMs offer superior predictive accuracy but with explanations dominated by noise at the word-level, univariate granularity. The implication is that sophisticated LLM reasoning likely occupies richer, higher-dimensional manifolds not captured by simplistic attribution schemes, necessitating future explanation frameworks capable of modeling multi-token, multi-channel, and higher-order interactions without sacrificing intelligibility or practical usability (Bogaert et al., 2024, Zhang et al., 24 May 2025, Atakishiyev et al., 20 Oct 2025).