- The paper’s main contribution is an empirical comparison of Integrated Gradients, SHAP, and attention-rollout on a fine-tuned DistilBERT model.
- It reveals that Integrated Gradients provide the most stable and faithful attributions, while attention-based approaches often misalign with prediction-relevant tokens.
- The study highlights practical implications for debugging and deploying NLP systems, emphasizing the need for rigorous evaluation of explainability tools.
Introduction
Transformer-based LLMs such as BERT and its derivatives have become central to state-of-the-art NLP, enabling high-fidelity language understanding and generation. However, their inherent architectural complexity and reliance on deep attention mechanisms have led to concerns regarding transparency, particularly in operational contexts demanding accountability, trust, and debugging. Despite the proliferation of explainable AI (XAI) techniques targeting these architectures, there remains a notable gap between theoretical explainability frameworks and their practical applicability. This paper, "Applied Explainability for LLMs: A Comparative Study" (2604.15371), presents an empirical comparison of Integrated Gradients, attention-rollout, and SHAP for post-hoc explanation of a fine-tuned DistilBERT model, focusing on real-world criteria such as faithfulness, stability, and interpretability.
Methodological Overview
The paper establishes a taxonomy of explainability techniques relevant to transformer-based LLMs, emphasizing four primary classes: attention-based methods, gradient-based attribution, feature attribution, and example-based explanations.
- Attention-based methods extract information from self-attention weights, commonly visualized to highlight token interplay but often criticized for weak causal alignment.
- Gradient-based attribution, epitomized by Integrated Gradients (IG), leverages differentiable sensitivity analyses relative to baseline inputs, yielding token-level influence scores.
- Feature attribution approaches, such as SHAP, perturb input representations and aggregate model responses to estimate Shapley value-based importance scores, offering model-agnostic flexibility at substantial computational cost.
- Example-based explanations (e.g., TracIn), which trace the influence of specific training instances, are positioned as valuable for dataset auditing but less prevalent in direct interpretability workflows.
Experimental evaluation was conducted using a frozen, fine-tuned DistilBERT model on the SST-2 sentiment classification dataset, enabling reproducible and consistent comparison. The primary evaluation criteria were faithfulness (alignment with prediction-relevant features), stability (consistency across runs), and human interpretability.
Empirical Findings
Quantitative Results
Integrated Gradients emerged as the most stable and faithful method, providing consistent token-level attributions across multiple evaluations of similar inputs. SHAP demonstrated notable variability and sensitivity to input configuration and background sampling, resulting in unstable attribution distributions. Attention Rollout was computationally efficient but failed to faithfully identify sentiment-relevant tokens, often prioritizing syntactic or structural elements.
Qualitative Insights
IG consistently highlighted sentiment-bearing tokens such as adjectives and intensifiers (e.g., "wonderful", "engaging"), closely mirroring human intuition and aligning with model predictions. In contrast, attention-based explanations frequently emphasized structural tokens (e.g., [CLS], stopwords), reducing their practical interpretability. SHAP outputs, although capable of detecting relevant tokens, were visually noisier and required careful configuration for meaningful analysis.
Failure Modes
All methods exhibited characteristic failure cases:
- Attention-based techniques often misaligned explanations, emphasizing irrelevant tokens, confirming prior skepticism regarding their faithfulness as causal explanations.
- SHAP was highly sensitive to input perturbations and background distribution, complicating large-scale, reproducible analysis.
- Integrated Gradients depend on access to model gradients and baseline selection; however, no severe instability was observed in the conducted experiments.
Trade-off Analysis
| Method |
Strengths |
Limitations |
Practical Usefulness |
| Integrated Gradients |
Stable, high faithfulness |
Gradient access, baseline selection |
Reliable for debugging and practical NLP analysis |
| SHAP |
Flexible, model-agnostic |
Computational overhead, instability under transformers |
Restricted scalability—suitable for qualitative cases |
| Attention Rollout |
Fast, easy computation |
Weak causal alignment, emphasis on structural tokens |
Less reliable for faithful explanation |
Practical Implications
The comparative analysis underscores the necessity for practitioners to rigorously evaluate explainability tools not just by theoretical soundness, but by operational criteria. IG offers consistency and interpretability suitable for debugging and stakeholder communication within production ML systems. Attention-based methods, although efficient, provide explanations of limited faithfulness, suggesting utility mainly for exploratory analysis. The model-agnosticity of SHAP counters its instability and computational expense, relegating it to targeted, qualitative investigation rather than routine deployment.
Explainability is most impactful when attribution methods expose genuine model reasoning paths, aiding root-cause analysis of predictions and misclassifications. However, the observed failure cases caution against over-reliance on single techniques or naive interpretation of highlighted tokens.
Limitations and Generalizability
The study's scope is restricted by the use of SST-2 (short English sentiment reviews) and a medium-scale DistilBERT model. The behaviour of explainability methods may differ markedly in longer document settings, multi-label tasks, domain-specific (e.g., biomedical, legal) or multilingual corpora, and larger instruction-tuned LLMs. Human interpretation bias further complicates the assessment of explanation quality, necessitating domain-informed, contextual validation.
Theoretical and Practical Implications
The delineated trade-offs have direct implications for the evolving landscape of XAI in LLMs. The practical unreliability of attention weights as explanations is a notable finding, challenging the prevalent use of attention visualization tools in model reporting and debugging. The stability and faithfulness of IG confirm gradient-based attribution as a core diagnostic technique in applied NLP, whereas the computational limitations of SHAP highlight scalability as a critical bottleneck for model-agnostic explanation frameworks.
Future Directions
Extending the analysis to larger and instruction-tuned LLMs, more complex tasks, and diverse languages/domains will be essential to generalize findings. There remains scope for developing explainability mechanisms that reconcile efficiency with causal faithfulness, particularly in multimodal or sequential decision contexts. Aligning attention-based explanations with robust causal evidence is a promising theoretical direction, and standardized evaluation protocols for operational effectiveness are needed.
Conclusion
In summary, the study provides an empirically grounded comparison of prevalent XAI methods for transformer-based sentiment classification, revealing substantial differences in faithfulness, stability, and practicality. Integrated Gradients are found to offer consistent and intuitive explanations, making them preferable for production-aligned interpretability tasks. Attention-based and model-agnostic methods, while useful in specific contexts, require careful consideration given their observed limitations. Ongoing research is needed to enhance the scalability, faithfulness, and domain adaptability of explainability tools for LLMs to ensure responsible and accountable machine learning deployment.