Applied Explainability for Large Language Models: A Comparative Study

Published 15 Apr 2026 in cs.CL, cs.AI, and cs.LG | (2604.15371v1)

Abstract: LLMs achieve strong performance across many natural language processing tasks, yet their decision processes remain difficult to interpret. This lack of transparency creates challenges for trust, debugging, and deployment in real-world systems. This paper presents an applied comparative study of three explainability techniques: Integrated Gradients, Attention Rollout, and SHAP, on a fine-tuned DistilBERT model for SST-2 sentiment classification. Rather than proposing new methods, the focus is on evaluating the practical behavior of existing approaches under a consistent and reproducible setup. The results show that gradient-based attribution provides more stable and intuitive explanations, while attention-based methods are computationally efficient but less aligned with prediction-relevant features. Model-agnostic approaches offer flexibility but introduce higher computational cost and variability. This work highlights key trade-offs between explainability methods and emphasizes their role as diagnostic tools rather than definitive explanations. The findings provide practical insights for researchers and engineers working with transformer-based NLP systems. This is a preprint and has not undergone peer review.

Abstract PDF Upgrade to Chat

Authors (1)

Venkata Abhinandan Kancharla

Summary

The paper’s main contribution is an empirical comparison of Integrated Gradients, SHAP, and attention-rollout on a fine-tuned DistilBERT model.
It reveals that Integrated Gradients provide the most stable and faithful attributions, while attention-based approaches often misalign with prediction-relevant tokens.
The study highlights practical implications for debugging and deploying NLP systems, emphasizing the need for rigorous evaluation of explainability tools.

Comparative Analysis of Explainability Techniques for Transformer-based NLP Models

Introduction

Transformer-based LLMs such as BERT and its derivatives have become central to state-of-the-art NLP, enabling high-fidelity language understanding and generation. However, their inherent architectural complexity and reliance on deep attention mechanisms have led to concerns regarding transparency, particularly in operational contexts demanding accountability, trust, and debugging. Despite the proliferation of explainable AI (XAI) techniques targeting these architectures, there remains a notable gap between theoretical explainability frameworks and their practical applicability. This paper, "Applied Explainability for LLMs: A Comparative Study" (2604.15371), presents an empirical comparison of Integrated Gradients, attention-rollout, and SHAP for post-hoc explanation of a fine-tuned DistilBERT model, focusing on real-world criteria such as faithfulness, stability, and interpretability.

Methodological Overview

The paper establishes a taxonomy of explainability techniques relevant to transformer-based LLMs, emphasizing four primary classes: attention-based methods, gradient-based attribution, feature attribution, and example-based explanations.

Attention-based methods extract information from self-attention weights, commonly visualized to highlight token interplay but often criticized for weak causal alignment.
Gradient-based attribution, epitomized by Integrated Gradients (IG), leverages differentiable sensitivity analyses relative to baseline inputs, yielding token-level influence scores.
Feature attribution approaches, such as SHAP, perturb input representations and aggregate model responses to estimate Shapley value-based importance scores, offering model-agnostic flexibility at substantial computational cost.
Example-based explanations (e.g., TracIn), which trace the influence of specific training instances, are positioned as valuable for dataset auditing but less prevalent in direct interpretability workflows.

Experimental evaluation was conducted using a frozen, fine-tuned DistilBERT model on the SST-2 sentiment classification dataset, enabling reproducible and consistent comparison. The primary evaluation criteria were faithfulness (alignment with prediction-relevant features), stability (consistency across runs), and human interpretability.

Empirical Findings

Quantitative Results

Integrated Gradients emerged as the most stable and faithful method, providing consistent token-level attributions across multiple evaluations of similar inputs. SHAP demonstrated notable variability and sensitivity to input configuration and background sampling, resulting in unstable attribution distributions. Attention Rollout was computationally efficient but failed to faithfully identify sentiment-relevant tokens, often prioritizing syntactic or structural elements.

Qualitative Insights

IG consistently highlighted sentiment-bearing tokens such as adjectives and intensifiers (e.g., "wonderful", "engaging"), closely mirroring human intuition and aligning with model predictions. In contrast, attention-based explanations frequently emphasized structural tokens (e.g., [CLS], stopwords), reducing their practical interpretability. SHAP outputs, although capable of detecting relevant tokens, were visually noisier and required careful configuration for meaningful analysis.

Failure Modes

All methods exhibited characteristic failure cases:

Attention-based techniques often misaligned explanations, emphasizing irrelevant tokens, confirming prior skepticism regarding their faithfulness as causal explanations.
SHAP was highly sensitive to input perturbations and background distribution, complicating large-scale, reproducible analysis.
Integrated Gradients depend on access to model gradients and baseline selection; however, no severe instability was observed in the conducted experiments.

Trade-off Analysis

Method	Strengths	Limitations	Practical Usefulness
Integrated Gradients	Stable, high faithfulness	Gradient access, baseline selection	Reliable for debugging and practical NLP analysis
SHAP	Flexible, model-agnostic	Computational overhead, instability under transformers	Restricted scalability—suitable for qualitative cases
Attention Rollout	Fast, easy computation	Weak causal alignment, emphasis on structural tokens	Less reliable for faithful explanation

Practical Implications

The comparative analysis underscores the necessity for practitioners to rigorously evaluate explainability tools not just by theoretical soundness, but by operational criteria. IG offers consistency and interpretability suitable for debugging and stakeholder communication within production ML systems. Attention-based methods, although efficient, provide explanations of limited faithfulness, suggesting utility mainly for exploratory analysis. The model-agnosticity of SHAP counters its instability and computational expense, relegating it to targeted, qualitative investigation rather than routine deployment.

Explainability is most impactful when attribution methods expose genuine model reasoning paths, aiding root-cause analysis of predictions and misclassifications. However, the observed failure cases caution against over-reliance on single techniques or naive interpretation of highlighted tokens.

Limitations and Generalizability

The study's scope is restricted by the use of SST-2 (short English sentiment reviews) and a medium-scale DistilBERT model. The behaviour of explainability methods may differ markedly in longer document settings, multi-label tasks, domain-specific (e.g., biomedical, legal) or multilingual corpora, and larger instruction-tuned LLMs. Human interpretation bias further complicates the assessment of explanation quality, necessitating domain-informed, contextual validation.

Theoretical and Practical Implications

The delineated trade-offs have direct implications for the evolving landscape of XAI in LLMs. The practical unreliability of attention weights as explanations is a notable finding, challenging the prevalent use of attention visualization tools in model reporting and debugging. The stability and faithfulness of IG confirm gradient-based attribution as a core diagnostic technique in applied NLP, whereas the computational limitations of SHAP highlight scalability as a critical bottleneck for model-agnostic explanation frameworks.

Future Directions

Extending the analysis to larger and instruction-tuned LLMs, more complex tasks, and diverse languages/domains will be essential to generalize findings. There remains scope for developing explainability mechanisms that reconcile efficiency with causal faithfulness, particularly in multimodal or sequential decision contexts. Aligning attention-based explanations with robust causal evidence is a promising theoretical direction, and standardized evaluation protocols for operational effectiveness are needed.

Conclusion

In summary, the study provides an empirically grounded comparison of prevalent XAI methods for transformer-based sentiment classification, revealing substantial differences in faithfulness, stability, and practicality. Integrated Gradients are found to offer consistent and intuitive explanations, making them preferable for production-aligned interpretability tasks. Attention-based and model-agnostic methods, while useful in specific contexts, require careful consideration given their observed limitations. Ongoing research is needed to enhance the scalability, faithfulness, and domain adaptability of explainability tools for LLMs to ensure responsible and accountable machine learning deployment.

Markdown Report Issue