Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

From Understanding to Utilization: A Survey on Explainability for Large Language Models (2401.12874v2)

Published 23 Jan 2024 in cs.CL and cs.AI
From Understanding to Utilization: A Survey on Explainability for Large Language Models

Abstract: Explainability for LLMs is a critical yet challenging aspect of natural language processing. As LLMs are increasingly integral to diverse applications, their "black-box" nature sparks significant concerns regarding transparency and ethical use. This survey underscores the imperative for increased explainability in LLMs, delving into both the research on explainability and the various methodologies and tasks that utilize an understanding of these models. Our focus is primarily on pre-trained Transformer-based LLMs, such as LLaMA family, which pose distinctive interpretability challenges due to their scale and complexity. In terms of existing methods, we classify them into local and global analyses, based on their explanatory objectives. When considering the utilization of explainability, we explore several compelling methods that concentrate on model editing, control generation, and model enhancement. Additionally, we examine representative evaluation metrics and datasets, elucidating their advantages and limitations. Our goal is to reconcile theoretical and empirical understanding with practical implementation, proposing exciting avenues for explanatory techniques and their applications in the LLMs era.

Introduction

In the domain of NLP, LLMs stand at the forefront of current technological advancements, distinguished by their impressive array of capabilities. This surge in effectiveness is met with inherent complexities—most notably, the opaque nature of these models, which impede the transparency necessary for trust and ethical application. Recognizing these challenges, this paper expounds on explainability within the context of Transformer-based pre-trained LLMs.

Explainability Methods for LLMs

The classification of methods for discerning model reasoning is an essential facet of this paper. These have been compartmentalized into Local and Global Analysis strategies. Local Analysis pinpoints the specific inputs, such as tokens, that influence the model's outcomes, exploring techniques like feature attribution analysis. On the global scale, methods such as probes endeavor to understand the broader linguistic knowledge encapsulated within a model's architecture.

The role of attention mechanisms, particularly multi-head self-attention (MHSA) and feed-forward neural networks (FFN), is scrutinized for a more profound comprehension of the intermediate processes. Attention distribution, gradient attribution, and vocabulary projections are some of the mechanisms under investigation. These approaches enable dissection of the complexities within Transformer blocks to extract insights about LLM operations.

Applications of Explainability

Beyond theoretical understanding, explainability intersects with practical applications, aiming to refine LLMs in terms of functionality and ethical alignment. Incorporating explainability insights into model editing facilitates precise modifications without compromising overall performance on unrelated tasks. Additionally, leveraging these insights can optimize model capacity, especially in processing extended text lengths and In-Context Learning. Furthermore, explainability stands as a pillar in the development of responsible AI, providing pathways for reducing hallucinations and aligning ethical outcomes with human values.

Evaluation and Future Directions

An assessment of explanation plausibility and the aftermath of model editing is paramount for gauging the effectiveness of attribution methods. Datasets like ZsRE and CounterFact emerge as valuable assets for evaluating factual editing. To appraise truthfulness, the TruthfulQA benchmark becomes instrumental, with a focus on both the veracity and informativeness of output.

The future trajectory involves crafting explainability methods that resonate with various model frameworks and harnessing said explainability to facilitate the construction of trustworthy and human-value aligned LLMs. As these models evolve, clarity and fairness will become increasingly pivotal in harnessing their full potential for benefit, positioning explainability not as an option but as a cornerstone of LLM development and deployment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (78)
  1. 2023. The devil is in the neurons: Interpreting and mitigating social biases in language models. In Openreview for International Conference on Learning Representations 2024.
  2. Samira Abnar and Willem Zuidema. 2020. Quantifying attention flow in transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4190–4197, Online. Association for Computational Linguistics.
  3. Haozhe An and Rachel Rudinger. 2023. Nichelle and nancy: The influence of demographic attributes and tokenization length on first name biases. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 388–401, Toronto, Canada. Association for Computational Linguistics.
  4. A general language assistant as a laboratory for alignment.
  5. A diagnostic study of explainability techniques for text classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3256–3274, Online. Association for Computational Linguistics.
  6. Layer normalization.
  7. Grad-sam: Explaining transformers via gradient self-attention maps. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, CIKM ’21, page 2882–2887, New York, NY, USA. Association for Computing Machinery.
  8. "will you find these shortcuts?" a protocol for evaluating the faithfulness of input salience methods for text classification.
  9. Eliciting latent predictions from transformers with the tuned lens.
  10. Longformer: The long-document transformer.
  11. Language models are few-shot learners.
  12. Generating hierarchical explanations on text classification via feature interaction detection. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5578–5593, Online. Association for Computational Linguistics.
  13. Evaluating large language models trained on code.
  14. Dola: Decoding by contrasting layers improves factuality in large language models.
  15. A toy model of universality: Reverse engineering how networks learn group operations.
  16. Knowledge neurons in pretrained transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8493–8502, Dublin, Ireland. Association for Computational Linguistics.
  17. Analyzing transformers in embedding space. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16124–16170, Toronto, Canada. Association for Computational Linguistics.
  18. Jump to conclusions: Short-cutting transformers with linear transformations.
  19. A survey on in-context learning.
  20. Joseph Enguehard. 2023. Sequential integrated gradients: a simple but effective method for explaining language models.
  21. Pathologies of neural models make interpretations difficult. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3719–3728, Brussels, Belgium. Association for Computational Linguistics.
  22. Measuring the mixing of contextual information in the transformer. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8698–8714, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  23. FairPrism: Evaluating fairness-related harms in text generation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6231–6251, Toronto, Canada. Association for Computational Linguistics.
  24. RealToxicityPrompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369, Online. Association for Computational Linguistics.
  25. Dissecting recall of factual associations in auto-regressive language models.
  26. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 30–45, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  27. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  28. Overthinking the truth: Understanding how language models process false demonstrations.
  29. Self-attention attribution: Interpreting information interactions inside transformer.
  30. In-context learning creates task vectors.
  31. Inspecting and editing knowledge representations in language models.
  32. John Hewitt and Christopher D. Manning. 2019. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129–4138, Minneapolis, Minnesota. Association for Computational Linguistics.
  33. Transformer-patcher: One mistake worth one neuron.
  34. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
  35. Shahar Katz and Yonatan Belinkov. 2023. Interpreting transformer’s attention dynamic memory and visualizing the semantic information flow of gpt.
  36. The (un)reliability of saliency methods.
  37. Investigating the influence of noise and distractors on the interpretation of neural networks.
  38. Attention is not only a weight: Analyzing transformers with vector norms. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7057–7075, Online. Association for Computational Linguistics.
  39. Analyzing feed-forward blocks in transformers through the lens of attention map.
  40. Zero-shot relation extraction via reading comprehension. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 333–342, Vancouver, Canada. Association for Computational Linguistics.
  41. Emergent world representations: Exploring a sequence model trained on a synthetic task.
  42. Inference-time intervention: Eliciting truthful answers from a language model.
  43. Pmet: Precise model editing in a transformer. ArXiv, abs/2308.08742.
  44. Alpaca: A new semi-analytic model for metal absorption lines emerging from clumpy galactic environments.
  45. Truthfulqa: Measuring how models mimic human falsehoods.
  46. Lost in the middle: How language models use long contexts.
  47. Scott Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions.
  48. Locating and editing factual associations in gpt.
  49. Mass-editing memory in a transformer.
  50. Memory-based model editing at scale.
  51. DecompX: Explaining transformers decisions by propagating token decomposition. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2649–2664, Toronto, Canada. Association for Computational Linguistics.
  52. GlobEnc: Quantifying global token attribution by incorporating the whole encoder layer in transformers. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 258–271, Seattle, United States. Association for Computational Linguistics.
  53. OpenAI. 2023. Gpt-4 technical report.
  54. Training language models to follow instructions with human feedback.
  55. Judea Pearl et al. 2000. Models, reasoning and inference. Cambridge, UK: CambridgeUniversityPress, 19(2):3.
  56. Copen: Probing conceptual knowledge in pre-trained language models.
  57. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473, Hong Kong, China. Association for Computational Linguistics.
  58. Efficiently scaling transformer inference.
  59. Perturbation augmentation for fairer NLP. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9496–9521, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  60. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  61. What are you token about? dense retrieval as distributions over the vocabulary. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2481–2498, Toronto, Canada. Association for Computational Linguistics.
  62. "why should i trust you?": Explaining the predictions of any classifier.
  63. Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in nlp.
  64. Integrated directional gradients: Feature interaction attribution for neural NLP models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 865–878, Online. Association for Computational Linguistics.
  65. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3319–3328. PMLR.
  66. Function vectors in large language models.
  67. Llama 2: Open foundation and fine-tuned chat models.
  68. Transformer Circuits. 2022. Mechanistic interpretations of transformer circuits. Accessed: [insert access date here].
  69. Attention is all you need. In NIPS.
  70. Causal mediation analysis for interpreting neural nlp: The case of gender bias.
  71. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small.
  72. Label words are anchors: An information flow perspective for understanding in-context learning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9840–9855, Singapore. Association for Computational Linguistics.
  73. Ethical and social risks of harm from language models.
  74. Efficient streaming language models with attention sinks.
  75. Local interpretation of transformer based on linear decomposition. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10270–10287, Toronto, Canada. Association for Computational Linguistics.
  76. Editing large language models: Problems, methods, and opportunities.
  77. Few-shot out-of-domain transfer learning of natural language explanations in a label-abundant setup. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3486–3501, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  78. Explainability for large language models: A survey.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Haoyan Luo (3 papers)
  2. Lucia Specia (68 papers)
Citations (16)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com