Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Primer on the Inner Workings of Transformer-based Language Models (2405.00208v3)

Published 30 Apr 2024 in cs.CL

Abstract: The rapid progress of research aimed at interpreting the inner workings of advanced LLMs has highlighted a need for contextualizing the insights gained from years of work in this area. This primer provides a concise technical introduction to the current techniques used to interpret the inner workings of Transformer-based LLMs, focusing on the generative decoder-only architecture. We conclude by presenting a comprehensive overview of the known internal mechanisms implemented by these models, uncovering connections across popular approaches and active research directions in this area.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (410)
  1. S. Abnar and W. Zuidema. Quantifying attention flow in transformers. In D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  4190–4197, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.385. URL https://aclanthology.org/2020.acl-main.385.
  2. AttnLRP: Attention-aware layer-wise relevance propagation for transformers, 2024.
  3. Sanity checks for saliency maps. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, pp.  9525–9536, Red Hook, NY, USA, 2018. Curran Associates Inc.
  4. Debugging tests for model explanations. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.
  5. Post hoc explanations may be ineffective for detecting unknown spurious correlation. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=xNOVfCCvDpM.
  6. Faithfulness vs. plausibility: On the (un)reliability of explanations from large language models. ArXiv, abs/2402.04614, 2024. URL https://api.semanticscholar.org/CorpusID:267523276.
  7. Intriguing properties of quantization at scale. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=IYe8j7Gy8f.
  8. Towards tracing knowledge in language models back to the training data. In Y. Goldberg, Z. Kozareva, and Y. Zhang (eds.), Findings of the Association for Computational Linguistics: EMNLP 2022, pp.  2429–2446, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-emnlp.180. URL https://aclanthology.org/2022.findings-emnlp.180.
  9. What learning algorithm is in-context learning? investigations with linear models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=0g0X4H8yN4I.
  10. In-context language learning: Architectures and algorithms, 2024.
  11. G. Alain and Y. Bengio. Understanding intermediate layers using linear classifier probes. Arxiv, 2016. URL https://arxiv.org/abs/1610.01644.
  12. J. Alammar. Ecco: An open source library for the explainability of transformer language models. In H. Ji, J. C. Park, and R. Xia (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pp.  249–257, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-demo.30. URL https://aclanthology.org/2021.acl-demo.30.
  13. XAI for transformers: Better explanations through conservative propagation. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  435–451. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/ali22a.html.
  14. The hidden attention of mamba models, 2024.
  15. Syntaxshap: Syntax-aware explainability method for text generation. ArXiv, abs/2402.09259, 2024. URL https://api.semanticscholar.org/CorpusID:267657673.
  16. Exploring length generalization in large language models. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=zSkYVeX7bC4.
  17. Anonymous. The disagreement problem in explainable machine learning: A practitioner’s perspective. Submitted to Transactions on Machine Learning Research, 2024. URL https://openreview.net/forum?id=jESY2WTZCe. Under review.
  18. Refusal in llms is mediated by a single direction. Alignment Forum, 2024. URL https://alignmentforum.org/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction.
  19. Causalgym: Benchmarking causal interpretability methods on linguistic tasks, 2024. URL https://arxiv.org/abs/2402.12560.
  20. Linear algebraic structure of word senses, with applications to polysemy. Transactions of the Association for Computational Linguistics, 6:483–495, 2018. doi: 10.1162/tacl_a_00034. URL https://aclanthology.org/Q18-1034.
  21. A diagnostic study of explainability techniques for text classification. In B. Webber, T. Cohn, Y. He, and Y. Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  3256–3274, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.263. URL https://aclanthology.org/2020.emnlp-main.263.
  22. Faithfulness tests for natural language explanations. In A. Rogers, J. Boyd-Graber, and N. Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.  283–294, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-short.25. URL https://aclanthology.org/2023.acl-short.25.
  23. ferret: a framework for benchmarking explainers on transformers. In D. Croce and L. Soldaini (eds.), Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pp.  256–266, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.eacl-demo.29. URL https://aclanthology.org/2023.eacl-demo.29.
  24. What changed? converting representational interventions to natural language. Arxiv, 2024. URL https://arxiv.org/abs/2402.11355.
  25. A. Azaria and T. Mitchell. The internal state of an LLM knows when it’s lying. In H. Bouamor, J. Pino, and K. Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  967–976, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.68. URL https://aclanthology.org/2023.findings-emnlp.68.
  26. Layer normalization. Arxiv, 2016. URL https://arxiv.org/abs/1607.06450.
  27. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLOS ONE, 10(7):1–46, 07 2015. doi: 10.1371/journal.pone.0130140. URL https://doi.org/10.1371/journal.pone.0130140.
  28. Training a helpful and harmless assistant with reinforcement learning from human feedback. ArXiv, 2022.
  29. The shattered gradients problem: If resnets are the answer, then what is the question? In D. Precup and Y. W. Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp.  342–350. PMLR, 06–11 Aug 2017. URL https://proceedings.mlr.press/v70/balduzzi17b.html.
  30. J. Bastings and K. Filippova. The elephant in the interpretability room: Why use attention as explanation when we have saliency methods? In A. Alishahi, Y. Belinkov, G. Chrupała, D. Hupkes, Y. Pinter, and H. Sajjad (eds.), Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pp.  149–155, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.blackboxnlp-1.14. URL https://aclanthology.org/2020.blackboxnlp-1.14.
  31. “will you find these shortcuts?” a protocol for evaluating the faithfulness of input salience methods for text classification. In Y. Goldberg, Z. Kozareva, and Y. Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  976–991, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.64. URL https://aclanthology.org/2022.emnlp-main.64.
  32. Identifying and controlling important neurons in neural machine translation. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=H1z-PsR5KX.
  33. Understanding the role of individual units in a deep neural network. Proceedings of the National Academy of Sciences, 117(48):30071–30078, 2020. doi: 10.1073/pnas.1907375117. URL https://www.pnas.org/doi/abs/10.1073/pnas.1907375117.
  34. National deep inference facility for very large language models (ndif). United States National Science Foundation, 2023.
  35. Y. Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219, March 2022. doi: 10.1162/coli_a_00422. URL https://aclanthology.org/2022.cl-1.7.
  36. Y. Belinkov and J. Glass. Analysis methods in neural language processing: A survey. Transactions of the Association for Computational Linguistics, 7:49–72, 2019. doi: 10.1162/tacl_a_00254. URL https://aclanthology.org/Q19-1004.
  37. What do neural machine translation models learn about morphology? In R. Barzilay and M.-Y. Kan (eds.), Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  861–872, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1080. URL https://aclanthology.org/P17-1080.
  38. N. Belrose. Least-squares concept erasure with oracle concept labels. EleutherAI Blog, 2023. URL https://blog.eleuther.ai/oracle-leace/.
  39. Eliciting latent predictions from transformers with the tuned lens. Arxiv, 2023a. URL https://arxiv.org/abs/2303.08112.
  40. LEACE: Perfect linear concept erasure in closed form. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b. URL https://openreview.net/forum?id=awIpKpwTwF.
  41. A neural probabilistic language model. J. Mach. Learn. Res., 3(null):1137–1155, mar 2003. ISSN 1532-4435.
  42. L. Bereska and E. Gavves. Mechanistic interpretability for ai safety – a review. ArXiv, 2024. URL https://arxiv.org/abs/2404.14082.
  43. The reversal curse: Llms trained on "a is b" fail to learn "b is a". ArXiv, abs/2309.12288, 2023. URL https://api.semanticscholar.org/CorpusID:262083829.
  44. Is attention explanation? an introduction to the debate. In S. Muresan, P. Nakov, and A. Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  3889–3900, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.269. URL https://aclanthology.org/2022.acl-long.269.
  45. Pythia: a suite for analyzing large language models across training and scaling. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
  46. Birth of a transformer: A memory viewpoint. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.  1560–1588. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/0561738a239a995c8cd2ef0e50cfa4fd-Paper-Conference.pdf.
  47. Language models can explain neurons in language models. https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html, 2023.
  48. Impossibility theorems for feature attribution. Proceedings of the National Academy of Sciences, 121(2):e2304406120, 2024. doi: 10.1073/pnas.2304406120. URL https://www.pnas.org/doi/abs/10.1073/pnas.2304406120.
  49. J. Bloom. Open source sparse autoencoders for all residual stream layers of GPT2 small. AI Alignment Forum, 2024. URL https://www.alignmentforum.org/posts/f9EgfLSurAiqRJySD/open-source-sparse-autoencoders-for-all-residual-stream.
  50. J. Bloom and D. Channin. Saelens. GitHub repository, 2024. URL https://github.com/jbloomAus/SAELens.
  51. J. Bloom and J. Lin. Understanding SAE features with the logit lens. AI Alignment Forum, 2024. URL https://www.alignmentforum.org/posts/qykrYY6rXXM7EEs8Q/understanding-sae-features-with-the-logit-lens.
  52. An interpretability illusion for bert, 2021.
  53. Quantizable transformers: Removing outliers by helping attention heads do nothing. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=sbusw6LD41.
  54. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. URL https://transformer-circuits.pub/2023/monosemantic-features/index.html.
  55. On the expressivity role of LayerNorm in transformers’ attention. In A. Rogers, J. Boyd-Graber, and N. Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp.  14211–14221, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.895. URL https://aclanthology.org/2023.findings-acl.895.
  56. On privileged and convergent bases in neural network representations, 2023.
  57. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  58. Understanding the origins of bias in word embeddings. In K. Chaudhuri and R. Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.  803–811. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/brunet19a.html.
  59. On identifiability in transformers. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=BJg1f6EFDB.
  60. Discovering latent knowledge in language models without supervision. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=ETKGuby0hcs.
  61. Thread: Circuits. Distill, 2020. doi: 10.23915/distill.00024. URL https://distill.pub/2020/circuits.
  62. N. Cancedda. Spectral filters, dark signals, and attention sinks, 2024. URL https://arxiv.org/abs/2402.09221.
  63. Black-box access is insufficient for rigorous ai audits. ArXiv, abs/2401.14446, 2024. URL https://api.semanticscholar.org/CorpusID:267301601.
  64. Do androids know they’re only dreaming of electric sheep?, 2023. URL https://arxiv.org/abs/2312.17249v1.
  65. Causal scrubbing, a method for rigorously testing interpretability hypotheses. AI Alignment Forum, 2022. URL https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing.
  66. Transformer interpretability beyond attention visualization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  782–791, June 2021.
  67. Sudden drops in the loss: Syntax acquisition, phase transitions, and simplicity bias in MLMs. In The Twelfth International Conference on Learning Representations, 2024a. URL https://openreview.net/forum?id=MO5PiKHELW.
  68. INSIDE: LLMs’ internal states retain the power of hallucination detection. In The Twelfth International Conference on Learning Representations, 2024b. URL https://openreview.net/forum?id=Zj12nzlQbz.
  69. Selfie: Self-interpretation of large language model embeddings, 2024c. URL https://arxiv.org/abs/2403.10949.
  70. In-context sharpness as alerts: An inner representation perspective for hallucination mitigation, 2024d.
  71. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023. URL http://jmlr.org/papers/v24/22-1144.html.
  72. Breaking down the defenses: A comparative survey of attacks on large language models, 2024.
  73. Dola: Decoding by contrasting layers improves factuality in large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Th6NyL07na.
  74. Summing up the facts: Additive mechanisms behind factual recall in llms, 2024. URL https://www.arxiv.org/abs/2402.07321.
  75. What does BERT look at? an analysis of BERT’s attention. In T. Linzen, G. Chrupała, Y. Belinkov, and D. Hupkes (eds.), Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp.  276–286, Florence, Italy, August 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-4828. URL https://aclanthology.org/W19-4828.
  76. Circuits updates - april 2024. update on how we train saes. Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/april-update/index.html.
  77. Towards automated circuit discovery for mechanistic interpretability. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.  16318–16352. Curran Associates, Inc., 2023. URL https://papers.nips.cc/paper_files/paper/2023/hash/34e1dbe95d34d7ebaf99b9bcaeb5b2be-Abstract-Conference.html.
  78. A. Cooney. CircuitVis, December 2022. URL https://github.com/alan-cooney/CircuitsVis.
  79. A. Cooney. Sparse autoencoder. GitHub repository, 2023. URL https://github.com/ai-safety-foundation/sparse_autoencoder.
  80. Adaptively sparse transformers. In K. Inui, J. Jiang, V. Ng, and X. Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  2174–2184, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1223. URL https://aclanthology.org/D19-1223.
  81. Toxicity in multilingual machine translation at scale. In H. Bouamor, J. Pino, and K. Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  9570–9586, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.642. URL https://aclanthology.org/2023.findings-emnlp.642.
  82. Explaining by removing: A unified framework for model explanation. Journal of Machine Learning Research, 22(209):1–90, 2021. URL http://jmlr.org/papers/v22/20-1316.html.
  83. J. Crabbé and M. van der Schaar. Evaluating the robustness of interpretability methods through explanation invariance and equivariance. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=5UwnKSgY6u.
  84. Are neural nets modular? inspecting functional modularity through differentiable weight masks. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=7uVcpu-gMD.
  85. Sparse autoencoders find highly interpretable features in language models. Arxiv, 2023. URL https://arxiv.org/abs/2309.08600.
  86. Knowledge neurons in pretrained transformers. In S. Muresan, P. Nakov, and A. Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  8493–8502, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.581. URL https://aclanthology.org/2022.acl-long.581.
  87. Detecting and mitigating hallucinations in machine translation: Model internal workings alone do well, sentence similarity Even better. In A. Rogers, J. Boyd-Graber, and N. Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  36–50, Toronto, Canada, July 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.3. URL https://aclanthology.org/2023.acl-long.3.
  88. HalOmi: A manually annotated benchmark for multilingual hallucination and omission detection in machine translation. In H. Bouamor, J. Pino, and K. Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  638–653, Singapore, December 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.42. URL https://aclanthology.org/2023.emnlp-main.42.
  89. What is one grain of sand in the desert? analyzing individual neurons in deep nlp models. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI’19/IAAI’19/EAAI’19. AAAI Press, 2019. ISBN 978-1-57735-809-1. doi: 10.1609/aaai.v33i01.33016309. URL https://doi.org/10.1609/aaai.v33i01.33016309.
  90. P. A. Daniel Johnson. Penzai. GitHub repository, 2024. URL https://github.com/google-deepmind/penzai.
  91. An adversarial example for direct logit attribution: Memory management in gelu-4l, 2023.
  92. Analyzing transformers in embedding space. In A. Rogers, J. Boyd-Graber, and N. Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  16124–16170, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.893. URL https://aclanthology.org/2023.acl-long.893.
  93. Vision transformers need registers. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=2dnO3LLiJ1.
  94. Language modeling with gated convolutional networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pp.  933–941. JMLR.org, 2017.
  95. How do decisions emerge across layers in neural models? interpretation with differentiable masking. In B. Webber, T. Cohn, Y. He, and Y. Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  3243–3255, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.262. URL https://aclanthology.org/2020.emnlp-main.262.
  96. Editing factual knowledge in language models. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  6491–6506, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.522. URL https://aclanthology.org/2021.emnlp-main.522.
  97. Sparse interventions in language models with differentiable masking. In J. Bastings, Y. Belinkov, Y. Elazar, D. Hupkes, N. Saphra, and S. Wiegreffe (eds.), Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pp.  16–27, Abu Dhabi, United Arab Emirates (Hybrid), December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.blackboxnlp-1.2. URL https://aclanthology.org/2022.blackboxnlp-1.2.
  98. Atman: Understanding transformer predictions through memory efficient attention manipulation. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.  63437–63460. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/c83bc020a020cdeb966ed10804619664-Paper-Conference.pdf.
  99. Extraction of salient sentences from labelled documents. Arxiv, 2015. URL https://arxiv.org/abs/1412.6815.
  100. Gpt3.int8(): 8-bit matrix multiplication for transformers at scale. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  30318–30332. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/c3ba4962c05c49636d4c6206a97e9c8a-Paper-Conference.pdf.
  101. BERT: Pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, and T. Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  102. ERASER: A benchmark to evaluate rationalized NLP models. In D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  4443–4458, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.408. URL https://aclanthology.org/2020.acl-main.408.
  103. How important is a neuron. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=SylKoo0cKm.
  104. Who needs to know what, when?: Broadening the explainable ai (xai) design space by looking at explanations across the ai lifecycle. In Proceedings of the 2021 ACM Designing Interactive Systems Conference, DIS ’21, pp.  1591–1602, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450384766. doi: 10.1145/3461778.3462131. URL https://doi.org/10.1145/3461778.3462131.
  105. Jump to conclusions: Short-cutting transformers with linear transformations. Arxiv, 2023. URL https://arxiv.org/abs/2303.09435.
  106. S. Ding and P. Koehn. Evaluating saliency methods for neural language models. In K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, and Y. Zhou (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  5034–5052, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.399. URL https://aclanthology.org/2021.naacl-main.399.
  107. F. Doshi-Velez and B. Kim. Towards a rigorous science of interpretable machine learning, 2017. URL https://arxiv.org/abs/1702.08608.
  108. Discovering salient neurons in deep nlp models. Journal of Machine Learning Research, 24(362):1–40, 2023. URL http://jmlr.org/papers/v24/23-0074.html.
  109. Amnesic probing: Behavioral explanation with amnesic counterfactuals. Transactions of the Association for Computational Linguistics, 9:160–175, 03 2021. ISSN 2307-387X. doi: 10.1162/tacl_a_00359. URL https://doi.org/10.1162/tacl_a_00359.
  110. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021a. URL https://transformer-circuits.pub/2021/framework/index.html.
  111. Garcon. Transformer Circuits Thread, 2021b. URL https://transformer-circuits.pub/2021/garcon/index.html.
  112. Softmax linear units. Transformer Circuits Thread, 2022a. URL https://transformer-circuits.pub/2022/solu/index.html.
  113. Toy models of superposition. Transformer Circuits Thread, 2022b. URL https://transformer-circuits.pub/2022/toy_model/index.html.
  114. Privileged bases in the transformer residual stream. Transformer Circuits Thread, 2023. URL https://transformer-circuits.pub/2023/privileged-basis/index.html.
  115. J. Enguehard. Sequential integrated gradients: a simple but effective method for explaining language models. In A. Rogers, J. Boyd-Graber, and N. Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp.  7555–7565, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.477. URL https://aclanthology.org/2023.findings-acl.477.
  116. K. Ethayarajh. How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. In K. Inui, J. Jiang, V. Ng, and X. Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  55–65, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1006. URL https://aclanthology.org/D19-1006.
  117. K. Ethayarajh and D. Jurafsky. Attention flows are shapley value explanations. In C. Zong, F. Xia, W. Li, and R. Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp.  49–54, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-short.8. URL https://aclanthology.org/2021.acl-short.8.
  118. Saliency map verbalization: Comparing feature importance representations from model-free and instruction-based methods. In B. Dalvi Mishra, G. Durrett, P. Jansen, D. Neves Ribeiro, and J. Wei (eds.), Proceedings of the 1st Workshop on Natural Language Reasoning and Structured Explanations (NLRSE), pp.  30–46, Toronto, Canada, June 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.nlrse-1.4. URL https://aclanthology.org/2023.nlrse-1.4.
  119. J. Ferrando and M. R. Costa-jussà. Attention weights in transformer NMT fail aligning words between sequences but largely explain model predictions. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih (eds.), Findings of the Association for Computational Linguistics: EMNLP 2021, pp.  434–443, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-emnlp.39. URL https://aclanthology.org/2021.findings-emnlp.39.
  120. J. Ferrando and E. Voita. Information flow routes: Automatically interpreting language models at scale. Arxiv, 2024. URL https://arxiv.org/abs/2403.00824.
  121. Towards opening the black box of neural machine translation: Source and target interpretations of the transformer. In Y. Goldberg, Z. Kozareva, and Y. Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  8756–8769, Abu Dhabi, United Arab Emirates, December 2022a. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.599. URL https://aclanthology.org/2022.emnlp-main.599.
  122. Measuring the mixing of contextual information in the transformer. In Y. Goldberg, Z. Kozareva, and Y. Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  8698–8714, Abu Dhabi, United Arab Emirates, December 2022b. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.595. URL https://aclanthology.org/2022.emnlp-main.595.
  123. Explaining how transformers use context to build predictions. In A. Rogers, J. Boyd-Graber, and N. Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  5486–5513, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.301. URL https://aclanthology.org/2023.acl-long.301.
  124. C. Fierro and A. Søgaard. Factual consistency of multilingual pretrained language models. In S. Muresan, P. Nakov, and A. Villavicencio (eds.), Findings of the Association for Computational Linguistics: ACL 2022, pp.  3046–3052, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.240. URL https://aclanthology.org/2022.findings-acl.240.
  125. J. Fiotto-Kaufman. nnsight: The package for interpreting and manipulating the internals of deep learned models. , 2024. URL https://github.com/JadenFiotto-Kaufman/nnsight.
  126. Unsupervised quality estimation for neural machine translation. Transactions of the Association for Computational Linguistics, 8:539–555, 2020. doi: 10.1162/tacl_a_00330. URL https://aclanthology.org/2020.tacl-1.35.
  127. Learning transformer programs. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36. Curran Associates, Inc., 2023. URL https://openreview.net/forum?id=Pe9WxkN8Ff.
  128. Neural natural language inference models partially embed theories of lexical entailment and negation. In A. Alishahi, Y. Belinkov, G. Chrupała, D. Hupkes, Y. Pinter, and H. Sajjad (eds.), Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pp.  163–173, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.blackboxnlp-1.16. URL https://aclanthology.org/2020.blackboxnlp-1.16.
  129. Causal abstractions of neural networks. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp.  9574–9586. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/4f5c422f4d49a5a807eda27434231040-Paper.pdf.
  130. Inducing causal structure for interpretable neural networks, 2022. URL https://arxiv.org/abs/2112.00826.
  131. Causal abstraction for faithful model interpretation, 2023a. URL https://arxiv.org/abs/2301.04709.
  132. Finding alignments between interpretable causal variables and distributed neural representations, 2023b. URL https://arxiv.org/abs/2303.02536.
  133. Gemma: Open models based on gemini research and technology. ArXiv, 2024. URL https://arxiv.org/abs/2403.08295.
  134. Transformer feed-forward layers are key-value memories. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  5484–5495, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.446. URL https://aclanthology.org/2021.emnlp-main.446.
  135. LM-debugger: An interactive tool for inspection and intervention in transformer-based language models. In W. Che and E. Shutova (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  12–21, Abu Dhabi, UAE, December 2022a. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-demos.2. URL https://aclanthology.org/2022.emnlp-demos.2.
  136. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. In Y. Goldberg, Z. Kozareva, and Y. Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  30–45, Abu Dhabi, United Arab Emirates, December 2022b. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.3. URL https://aclanthology.org/2022.emnlp-main.3.
  137. Dissecting recall of factual associations in auto-regressive language models. In H. Bouamor, J. Pino, and K. Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  12216–12235, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.751. URL https://aclanthology.org/2023.emnlp-main.751.
  138. Patchscopes: A unifying framework for inspecting hidden representations of language models. Arxiv, 2024. URL https://arxiv.org/abs/2401.06102v2.
  139. Localizing model behavior with path patching. Arxiv, 2023. URL https://arxiv.org/abs/2304.05969.
  140. Successor heads: Recurring, interpretable attention heads in the wild. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=kvcbV8KQsi.
  141. Olmo: Accelerating the science of language models, 2024.
  142. Studying large language model generalization with influence functions. ArXiv, abs/2308.03296, 2023. URL https://api.semanticscholar.org/CorpusID:260682872.
  143. A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces, 2023.
  144. Model editing can hurt general abilities of large language models. ArXiv, abs/2401.04700, 2024. URL https://api.semanticscholar.org/CorpusID:266899568.
  145. A geometric notion of causal probing, 2023.
  146. Optimal transport for unsupervised hallucination detection in neural machine translation. In A. Rogers, J. Boyd-Graber, and N. Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  13766–13784, Toronto, Canada, July 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.770. URL https://aclanthology.org/2023.acl-long.770.
  147. Looking for a needle in a haystack: A comprehensive study of hallucinations in neural machine translation. In A. Vlachos and I. Augenstein (eds.), Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp.  1059–1075, Dubrovnik, Croatia, May 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.eacl-main.75. URL https://aclanthology.org/2023.eacl-main.75.
  148. Distributional vectors encode referential attributes. In L. Màrquez, C. Callison-Burch, and J. Su (eds.), Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp.  12–21, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-1002. URL https://aclanthology.org/D15-1002.
  149. Model editing at scale leads to gradual and catastrophic forgetting. ArXiv, abs/2401.07453, 2024a. URL https://api.semanticscholar.org/CorpusID:266999650.
  150. A unified framework for model editing, 2024b.
  151. W. Gurnee. Sae reconstruction errors are (empirically) pathological. AI Alignment Forum, 2024. URL https://www.alignmentforum.org/posts/rZPiuFxESMxCDHe4B/sae-reconstruction-errors-are-empirically-pathological.
  152. W. Gurnee and M. Tegmark. Language models represent space and time. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=jE8xbmvFin.
  153. Finding neurons in a haystack: Case studies with sparse probing. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=JYs1R9IMJr.
  154. Universal neurons in gpt2 language models, 2024.
  155. Simfluence: Modeling the influence of individual training examples by simulating training runs. Arxiv, 2023. URL https://arxiv.org/abs/2303.08114.
  156. Z. Hammoudeh and D. Lowd. Training data influence analysis and estimation: A survey, 2022.
  157. Explaining black box predictions and unveiling data artifacts through influence functions. In D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  5553–5563, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.492. URL https://aclanthology.org/2020.acl-main.492.
  158. Parameter-efficient fine-tuning for large models: A comprehensive survey, 2024.
  159. How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.  76033–76060. Curran Associates, Inc., 2023. URL https://papers.nips.cc/paper_files/paper/2023/hash/efbba7719cc5172d175240f24be11280-Abstract-Conference.html.
  160. Have faith in faithfulness: Going beyond circuit overlap when finding model mechanisms, 2024. URL https://arxiv.org/abs/2403.17806.
  161. Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.  17643–17668. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/3927bbdcf0e8d1fa8aa23c26f358a281-Paper-Conference.pdf.
  162. Understanding transformer memorization recall through idioms. In A. Vlachos and I. Augenstein (eds.), Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp.  248–264, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.eacl-main.19. URL https://aclanthology.org/2023.eacl-main.19.
  163. Dictionary learning improves patch-free circuit discovery in mechanistic interpretability: A case study on othello-gpt. ArXiv, abs/2402.12201, 2024. URL https://api.semanticscholar.org/CorpusID:267751496.
  164. S. Heimersheim and J. Janiak. A circuit for python docstrings in a 4-layer attention-only transformer. AI Alignment Forum, 2023. URL https://www.alignmentforum.org/posts/u6KXXmKFbXfWzoAXn/a-circuit-for-python-docstrings-in-a-4-layer-attention-only.
  165. S. Heimersheim and N. Nanda. How to use and interpret activation patching. Arxiv, 2024. URL https://arxiv.org/abs/2404.15255.
  166. S. Heimersheim and A. Turner. Residual stream norms grow exponentially over the forward pass. AI Alignment Forum, 2023. URL https://www.alignmentforum.org/posts/8mizBCm3dyc432nK8/residual-stream-norms-grow-exponentially-over-the-forward.
  167. In-context learning creates task vectors. Arxiv, 2023. URL https://arxiv.org/abs/2310.15916.
  168. Linearity of relation decoding in transformer language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=w7LU2s14kE.
  169. J. Hewitt and P. Liang. Designing and interpreting probes with control tasks. In K. Inui, J. Jiang, V. Ng, and X. Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  2733–2743, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1275. URL https://aclanthology.org/D19-1275.
  170. J. Hewitt and C. D. Manning. A structural probe for finding syntax in word representations. In J. Burstein, C. Doran, and T. Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4129–4138, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1419. URL https://aclanthology.org/N19-1419.
  171. Backpack language models. In A. Rogers, J. Boyd-Graber, and N. Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  9103–9125, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.506. URL https://aclanthology.org/2023.acl-long.506.
  172. Enhanced hallucination detection in neural machine translation through simple detector aggregation, 2024.
  173. An empirical analysis of compute-optimal large language model training. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  30016–30030. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/hash/c1e2faff6f588870935f114ebe04a3e5-Abstract-Conference.html.
  174. Surface form competition: Why the highest probability answer isn’t always right. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  7038–7051, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.564. URL https://aclanthology.org/2021.emnlp-main.564.
  175. exBERT: A Visual Analysis Tool to Explore Learned Representations in Transformer Models. In A. Celikyilmaz and T.-H. Wen (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp.  187–196, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-demos.22. URL https://aclanthology.org/2020.acl-demos.22.
  176. Do attention heads in bert track syntactic dependencies? Arxiv, 2019. URL https://arxiv.org/abs/1911.12246.
  177. Rigorously assessing natural language explanations of neurons. In Y. Belinkov, S. Hao, J. Jumelet, N. Kim, A. McCarthy, and H. Mohebbi (eds.), Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pp.  317–331, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.blackboxnlp-1.24. URL https://aclanthology.org/2023.blackboxnlp-1.24.
  178. Ravel: Evaluating interpretability methods on disentangling language model representations, 2024a. URL https://arxiv.org/abs/2402.17700.
  179. Unified view of grokking, double descent and emergent abilities: A perspective from circuits competition, 2024b.
  180. Trillion parameter ai serving infrastructure for scientific discovery: A survey and vision. Arxiv, 2024. URL https://arxiv.org/abs/2402.03480.
  181. Visualisation and ‘diagnostic classifiers’ reveal how recurrent and recursive neural networks process hierarchical structure. Journal of Artificial Intelligence Research, 61(1):907–926, 2018. ISSN 1076-9757.
  182. What happens when you fine-tuning your model? mechanistic analysis of procedurally generated tasks. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=A0HKeKl4Nl.
  183. S. Jain and B. C. Wallace. Attention is not Explanation. In J. Burstein, C. Doran, and T. Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  3543–3556, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1357. URL https://aclanthology.org/N19-1357.
  184. Residual connections encourage iterative inference, 2018.
  185. A. Jermyn and A. Templeton. Circuits updates - jnauary 2024. ghost grads: An improvement on resampling. Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/jan-update/index.html.
  186. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12), mar 2023. ISSN 0360-0300. doi: 10.1145/3571730. URL https://doi.org/10.1145/3571730.
  187. On the origins of linear representations in large language models, 2024.
  188. S. Joseph. Vit prisma: A mechanistic interpretability library for vision transformers. GitHub repository, 2023. URL https://github.com/soniajoseph/vit-prisma.
  189. G. Kamradt. Needle in a haystack - pressure testing llms. Github Repository, 2023. URL https://github.com/gkamradt/LLMTest_NeedleInAHaystack.
  190. Scaling laws for neural language models, 2020. URL https://arxiv.org/abs/2001.08361.
  191. S. Katz and Y. Belinkov. VISIT: Visualizing and interpreting the semantic information flow of transformers. In H. Bouamor, J. Pino, and K. Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  14094–14113, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.939. URL https://aclanthology.org/2023.findings-emnlp.939.
  192. Backward lens: Projecting language model gradients into the vocabulary space, 2024.
  193. Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (TCAV). In J. Dy and A. Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.  2668–2677. PMLR, 2018. URL https://proceedings.mlr.press/v80/kim18d.html.
  194. Sparse autoencoders work on attention layer outputs. AI Alignment Forum, 2024a. URL https://www.alignmentforum.org/posts/DtdzGwFh9dCfsekZZ.
  195. Attention saes scale to gpt-2 small. AI Alignment Forum, 2024b. URL https://www.alignmentforum.org/posts/FSTRedtjuHa4Gfdbr/attention-saes-scale-to-gpt-2-small.
  196. Attention is not only a weight: Analyzing transformers with vector norms. In B. Webber, T. Cohn, Y. He, and Y. Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  7057–7075, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.574. URL https://aclanthology.org/2020.emnlp-main.574.
  197. Incorporating Residual and Normalization Layers into Analysis of Masked Language Models. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  4547–4568, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.373. URL https://aclanthology.org/2021.emnlp-main.373.
  198. Transformer language models handle word frequency in prediction head. In A. Rogers, J. Boyd-Graber, and N. Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp.  4523–4535, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.276. URL https://aclanthology.org/2023.findings-acl.276.
  199. Analyzing feed-forward blocks in transformers through the lens of attention map. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=mYWsyTuiRp.
  200. P. W. Koh and P. Liang. Understanding black-box predictions via influence functions. In D. Precup and Y. W. Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp.  1885–1894. PMLR, 06–11 Aug 2017. URL https://proceedings.mlr.press/v70/koh17a.html.
  201. A. Köhn. What’s in an embedding? analyzing word embeddings through multilingual evaluation. In L. Màrquez, C. Callison-Burch, and J. Su (eds.), Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp.  2067–2073, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-1246. URL https://aclanthology.org/D15-1246.
  202. Captum: A unified and generic model interpretability library for pytorch. Arxiv, 2020. URL https://arxiv.org/abs/2009.07896.
  203. Revealing the dark secrets of BERT. In K. Inui, J. Jiang, V. Ng, and X. Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  4365–4374, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1445. URL https://aclanthology.org/D19-1445.
  204. BERT busters: Outlier dimensions that disrupt transformers. In C. Zong, F. Xia, W. Li, and R. Navigli (eds.), Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp.  3392–3405, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.300. URL https://aclanthology.org/2021.findings-acl.300.
  205. Atp*: An efficient and scalable method for localizing llm behaviour to components, 2024. URL https://arxiv.org/abs/2403.00745.
  206. We inspected every head in GPT-2 small using saes so you don’t have to. AI Alignment Forum, 2024. URL https://www.alignmentforum.org/posts/xmegeW5mqiBsvoaim/we-inspected-every-head-in-gpt-2-small-using-saes-so-you-don.
  207. Datainf: Efficiently estimating data influence in loRA-tuned LLMs and diffusion models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=9m02ib92Wz.
  208. InterpreT: An interactive visualization tool for interpreting transformers. In D. Gkatzia and D. Seddah (eds.), Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pp.  135–142, Online, April 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.eacl-demos.17. URL https://aclanthology.org/2021.eacl-demos.17.
  209. Decoderlens: Layerwise interpretation of encoder-decoder transformers. ArXiv, abs/2310.03686, 2023. URL https://api.semanticscholar.org/CorpusID:263671583.
  210. Measuring faithfulness in chain-of-thought reasoning. ArXiv, abs/2307.13702, 2023. URL https://api.semanticscholar.org/CorpusID:259953372.
  211. Influence-directed explanations for deep convolutional networks. In 2018 IEEE International Test Conference (ITC), pp.  1–8, 2018. doi: 10.1109/TEST.2018.8624792.
  212. Uncovering intermediate variables in transformers using circuit probing, 2023.
  213. Visualizing and understanding neural models in NLP. In K. Knight, A. Nenkova, and O. Rambow (eds.), Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  681–691, San Diego, California, June 2016. Association for Computational Linguistics. doi: 10.18653/v1/N16-1082. URL https://aclanthology.org/N16-1082.
  214. Understanding neural networks through representation erasure, 2017.
  215. Inference-time intervention: Eliciting truthful answers from a language model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. URL https://openreview.net/forum?id=aLLuYpn83y.
  216. Contrastive decoding: Open-ended text generation as optimization. In A. Rogers, J. Boyd-Graber, and N. Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  12286–12312, Toronto, Canada, July 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.687. URL https://aclanthology.org/2023.acl-long.687.
  217. Unveiling the pitfalls of knowledge editing for large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=fNktD3ib16.
  218. Questioning the ai: Informing design practices for explainable ai user experiences. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, CHI ’20, pp.  1–15, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450367080. doi: 10.1145/3313831.3376590. URL https://doi.org/10.1145/3313831.3376590.
  219. Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla. Arxiv, 2023. URL https://arxiv.org/abs/2307.09458.
  220. J. Lin and J. Bloom. Announcing neuronpedia: Platform for accelerating research into sparse autoencoders. AI Alignment Forum, 2024. URL https://www.alignmentforum.org/posts/BaEQoxHhWPrkinmxd/announcing-neuronpedia-platform-for-accelerating-research.
  221. Open sesame: Getting inside BERT’s linguistic knowledge. In T. Linzen, G. Chrupała, Y. Belinkov, and D. Hupkes (eds.), Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp.  241–253, Florence, Italy, August 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-4825. URL https://aclanthology.org/W19-4825.
  222. Tracr: Compiled transformers as a laboratory for interpretability. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.  37876–37899. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/771155abaae744e08576f1f3b4b7ac0d-Paper-Conference.pdf.
  223. Z. C. Lipton. The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue, 16(3):31–57, jun 2018. ISSN 1542-7730. doi: 10.1145/3236386.3241340. URL https://doi.org/10.1145/3236386.3241340.
  224. Linguistic knowledge and transferability of contextual representations. In J. Burstein, C. Doran, and T. Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  1073–1094, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1112. URL https://aclanthology.org/N19-1112.
  225. On training data influence of gpt models. Arxiv, 2024. URL https://arxiv.org/abs/2404.07840.
  226. Explainable artificial intelligence (xai) 2.0: A manifesto of open challenges and interdisciplinary research directions. Information Fusion, 106:102301, 2024. ISSN 1566-2535. doi: https://doi.org/10.1016/j.inffus.2024.102301. URL https://www.sciencedirect.com/science/article/pii/S1566253524000794.
  227. A brief prehistory of double descent. Proceedings of the National Academy of Sciences, 117(20):10625–10626, 2020. doi: 10.1073/pnas.2001875117. URL https://www.pnas.org/doi/abs/10.1073/pnas.2001875117.
  228. A unified approach to interpreting model predictions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdf.
  229. Positional artefacts propagate through masked language model embeddings. In C. Zong, F. Xia, W. Li, and R. Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  5312–5327, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.413. URL https://aclanthology.org/2021.acl-long.413.
  230. Interpreting key mechanisms of factual recall in transformer-based language models. Computing Research Repository, arXiv:2403.19521, 2024. URL https://arxiv.org/abs/2403.19521.
  231. Towards Faithful Model Explanation in NLP: A Survey. Computational Linguistics, pp.  1–70, 01 2024. ISSN 0891-2017. doi: 10.1162/coli_a_00511. URL https://doi.org/10.1162/coli_a_00511.
  232. Simple probes can catch sleeper agents. Anthropic, 2024. URL https://www.anthropic.com/news/probes-catch-sleeper-agents.
  233. Post-hoc interpretability for neural nlp: A survey. ACM Computing Surveys, 55(8), 2022. ISSN 0360-0300. doi: 10.1145/3546577. URL https://doi.org/10.1145/3546577.
  234. Are self-explanations from large language models faithful? ArXiv, abs/2401.07927, 2024. URL https://api.semanticscholar.org/CorpusID:266999774.
  235. Is this the subspace you are looking for? an interpretability illusion for subspace activation patching. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Ebt7JgMHv1.
  236. S. Marks and A. Mueller. Dictionary learning. GitHub repository, 2023. URL https://github.com/saprmarks/dictionary_learning.
  237. S. Marks and M. Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets, 2023. URL https://arxiv.org/abs/2310.06824.
  238. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. Computing Research Repository, arXiv:2403.19647, 2024. URL https://arxiv.org/abs/2403.19647.
  239. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In A. Korhonen, D. Traum, and L. Màrquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  3428–3448, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1334. URL https://aclanthology.org/P19-1334.
  240. C. McDougall. Six (and a half) intuitions for SVD. Callum McDougall Blog, 2023. URL https://www.perfectlynormal.co.uk/blog-kl-divergence.
  241. C. McDougall and J. Bloom. Sae-vis: Announcement post. LessWrong, 2024. URL https://www.lesswrong.com/posts/nAhy6ZquNY7AD3RkD/sae-vis-announcement-post-1.
  242. Copy suppression: Comprehensively understanding an attention head. Arxiv, 2023. URL https://arxiv.org/abs/2310.04625.
  243. Acquisition of chess knowledge in alphazero. Proceedings of the National Academy of Sciences, 119(47):e2206625119, 2022. doi: 10.1073/pnas.2206625119. URL https://www.pnas.org/doi/abs/10.1073/pnas.2206625119.
  244. The hydra effect: Emergent self-repair in language model computations. Arxiv, 2023. URL https://arxiv.org/abs/2307.15771.
  245. Locating and editing factual associations in GPT. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  17359–17372. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/hash/6f1d43d5a82a37e89b0665b33bf3a182-Abstract-Conference.html.
  246. Mass-editing memory in a transformer. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=MkbcAHIYgyS.
  247. Effects of parameter norm growth during transformer training: Inductive bias from gradient descent. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  1766–1781, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.133. URL https://aclanthology.org/2021.emnlp-main.133.
  248. A tale of two circuits: Grokking as competition of sparse and dense subnetworks. ArXiv, abs/2303.11873, 2023. URL https://api.semanticscholar.org/CorpusID:257636667.
  249. A mechanism for solving relational tasks in transformer language models, 2023. URL https://arxiv.org/abs/2305.16130.
  250. Circuit component reuse across tasks in transformer language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=fpoAYV6Wsk.
  251. Are sixteen heads really better than one? In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/2c601ad9d2ff9bc8b282670cdd54f69f-Paper.pdf.
  252. How to dissect a Muppet: The structure of transformer embedding spaces. Transactions of the Association for Computational Linguistics, 10:981–996, 2022. doi: 10.1162/tacl_a_00501. URL https://aclanthology.org/2022.tacl-1.57.
  253. Using captum to explain generative language models. In L. Tan, D. Milajevs, G. Chauhan, J. Gwinnup, and E. Rippeth (eds.), Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), pp.  165–173, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.nlposs-1.19. URL https://aclanthology.org/2023.nlposs-1.19.
  254. Distributed representations of words and phrases and their compositionality. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger (eds.), Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013. URL https://proceedings.neurips.cc/paper_files/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html.
  255. B. Millidge and S. Black. The singular value decompositions of transformer weight matrices are highly interpretable. AI Alignment Forum, 2022. URL https://www.alignmentforum.org/posts/mkbGjzxD8d8XqKHzA/the-singular-value-decompositions-of-transformer-weight.
  256. B. Millidge and E. Winsor. Basic facts about language model internals. AI Alignment Forum, 2023. URL https://www.alignmentforum.org/posts/PDLfpRwSynu73mxGw/basic-facts-about-language-model-internals-1.
  257. Large language models: A survey, 2024. URL https://arxiv.org/abs/2402.06196.
  258. Fast model editing at scale. In International Conference on Learning Representations, 2022a. URL https://openreview.net/forum?id=0DcZxeWfOPt.
  259. Memory-based model editing at scale. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  15817–15831. PMLR, 17–23 Jul 2022b. URL https://proceedings.mlr.press/v162/mitchell22a.html.
  260. GlobEnc: Quantifying global token attribution by incorporating the whole encoder layer in transformers. In M. Carpuat, M.-C. de Marneffe, and I. V. Meza Ruiz (eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  258–271, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.19. URL https://aclanthology.org/2022.naacl-main.19.
  261. DecompX: Explaining transformers decisions by propagating token decomposition. In A. Rogers, J. Boyd-Graber, and N. Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  2649–2664, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.149. URL https://aclanthology.org/2023.acl-long.149.
  262. Quantifying context mixing in transformers. In A. Vlachos and I. Augenstein (eds.), Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp.  3378–3400, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.eacl-main.245. URL https://aclanthology.org/2023.eacl-main.245.
  263. R. Molina. Traveling words: A geometric interpretation of transformers. Arxiv, 2023. URL https://arxiv.org/abs/2309.07315.
  264. A glitch in the matrix? locating and detecting language model grounding with fakepedia, 2024. URL https://arxiv.org/abs/2312.02073.
  265. Transformer debugger. https://github.com/openai/transformer-debugger, 2024.
  266. N. Nanda. Induction mosaic. Neel Nanda Blog, 2022a. URL https://neelnanda.io/mosaic.
  267. N. Nanda. Neuroscope: A website for mechanistic interpretability of language models. Website, 2022b. URL https://neuroscope.io/.
  268. N. Nanda. Attribution patching: Activation patching at industrial scale. https://www.neelnanda.io/mechanistic-interpretability/attribution-patching, 2023.
  269. N. Nanda and J. Bloom. Transformerlens. Github Repository, 2022. URL https://github.com/neelnanda-io/TransformerLens.
  270. Progress measures for grokking via mechanistic interpretability. In The Eleventh International Conference on Learning Representations, 2023a. URL https://openreview.net/forum?id=9XFSbDPmdW.
  271. Emergent linear representations in world models of self-supervised sequence models. In Y. Belinkov, S. Hao, J. Jumelet, N. Kim, A. McCarthy, and H. Mohebbi (eds.), Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pp.  16–30, Singapore, December 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.blackboxnlp-1.2. URL https://aclanthology.org/2023.blackboxnlp-1.2.
  272. Fact finding: Attempting to reverse-engineer factual recall on the neuron level. AI Alignment Forum, 2023c. URL https://www.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall.
  273. Interpreting context look-ups in transformers: Investigating attention-mlp interactions, 2024. URL https://arxiv.org/abs/2402.15055.
  274. Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pp.  3395–3403, Red Hook, NY, USA, 2016. Curran Associates Inc. ISBN 9781510838819.
  275. Investigating the limitations of transformers with simple arithmetic tasks, 2021.
  276. nostalgebraist. Interpreting GPT: the logit lens. AI Alignment Forum, 2020. URL https://www.alignmentforum.org/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens.
  277. B.-D. Oh and W. Schuler. Token-wise decomposition of autoregressive language model hidden states for analyzing model predictions. In A. Rogers, J. Boyd-Graber, and N. Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  10105–10117, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.562. URL https://aclanthology.org/2023.acl-long.562.
  278. C. Olah. Mechanistic interpretability, variables, and the importance of interpretable bases. Transformer Circuits Thread, 2022. URL https://transformer-circuits.pub/2022/mech-interp-essay.
  279. C. Olah. Distributed representations: Composition & superposition. Transformer Circuits Thread, 2023. URL https://transformer-circuits.pub/2023/superposition-composition/index.html.
  280. An overview of early vision in inceptionv1. Distill, 2020a. doi: 10.23915/distill.00024.002. https://distill.pub/2020/circuits/early-vision.
  281. Zoom in: An introduction to circuits. Distill, 2020b. doi: 10.23915/distill.00024.001. URL https://distill.pub/2020/circuits/zoom-in.
  282. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision Research, 37(23):3311–3325, 1997. ISSN 0042-6989. doi: https://doi.org/10.1016/S0042-6989(97)00169-7. URL https://www.sciencedirect.com/science/article/pii/S0042698997001697.
  283. In-context learning and induction heads. Transformer Circuits Thread, 2022. URL https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
  284. Competition of mechanisms: Tracing how language models handle facts and counterfactuals. Computing Research Repository, arXiv:2402.11655, 2024. URL https://arxiv.org/abs/2402.11655.
  285. G. PAIR Team. Saliency: Framework-agnostic implementation for state-of-the-art saliency methods, 2023. URL https://github.com/PAIR-code/saliency.
  286. Future lens: Anticipating subsequent tokens from a single hidden state. In J. Jiang, D. Reitter, and S. Deng (eds.), Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL), pp.  548–560, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.conll-1.37. URL https://aclanthology.org/2023.conll-1.37.
  287. L. Parcalabescu and A. Frank. On measuring faithfulness or self-consistency of natural language explanations, 2023.
  288. The linear representation hypothesis and the geometry of large language models. Arxiv, 2023a. URL https://arxiv.org/abs/2311.03658.
  289. Trak: attributing model behavior at scale. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023b.
  290. Does transformer interpretability transfer to rnns?, 2024.
  291. J. Pearl. Direct and indirect effects. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, UAI’01, pp.  411–420, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. ISBN 1558608001.
  292. J. Pearl. Causality. Cambridge University Press, 2 edition, 2009. doi: 10.1017/CBO9780511803161.
  293. Dissecting contextual word embeddings: Architecture and representation. In E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  1499–1509, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1179. URL https://aclanthology.org/D18-1179.
  294. Combining feature and instance attribution to detect artifacts. In S. Muresan, P. Nakov, and A. Villavicencio (eds.), Findings of the Association for Computational Linguistics: ACL 2022, pp.  1934–1946, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.153. URL https://aclanthology.org/2022.findings-acl.153.
  295. C. Pierse. Transformers Interpret, February 2021. URL https://github.com/cdpierse/transformers-interpret.
  296. Information-theoretic probing for linguistic structure. In D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  4609–4622, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.420. URL https://aclanthology.org/2020.acl-main.420.
  297. Grokking: Generalization beyond overfitting on small algorithmic datasets, 2022.
  298. Fine-tuning enhances existing mechanisms: A case study on entity tracking. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=8sKcAWOf2D.
  299. Outlier dimensions that disrupt transformers are driven by frequency. In Y. Goldberg, Z. Kozareva, and Y. Zhang (eds.), Findings of the Association for Computational Linguistics: EMNLP 2022, pp.  1286–1304, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-emnlp.93. URL https://aclanthology.org/2022.findings-emnlp.93.
  300. Cross-lingual consistency of factual knowledge in multilingual language models. In H. Bouamor, J. Pino, and K. Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  10650–10666, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.658. URL https://aclanthology.org/2023.emnlp-main.658.
  301. Training dynamics of contextual n-grams in language models, 2023. URL https://arxiv.org/abs/2311.00863.
  302. Learning to generate reviews and discovering sentiment. Arxiv, 2017. URL https://arxiv.org/abs/1704.01444.
  303. Improving language understanding by generative pre-training. OpenAI Blog, 2018. URL https://openai.com/research/language-unsupervised.
  304. Language models are unsupervised multitask learners. OpenAI Blog, 2019. URL https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.
  305. A. Raganato and J. Tiedemann. An analysis of encoder representations in transformer-based machine translation. In T. Linzen, G. Chrupała, and A. Alishahi (eds.), Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp.  287–297, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-5431. URL https://aclanthology.org/W18-5431.
  306. S. Rajamanoharan. Progress update 1 from the gdm mech interp team. improving ghost grads. AI Alignment Forum, 2024. URL https://www.alignmentforum.org/posts/C5KAZQib3bzzpeyrg/progress-update-1-from-the-gdm-mech-interp-team-full-update.
  307. Improving dictionary learning with gated sparse autoencoders. ArXiv, 2024.
  308. Null it out: Guarding protected attributes by iterative nullspace projection. In D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  7237–7256, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.647. URL https://aclanthology.org/2020.acl-main.647.
  309. Linear adversarial concept erasure. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  18400–18421. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/ravfogel22a.html.
  310. "why should I trust you?": Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, pp.  1135–1144, 2016.
  311. A Primer in BERTology: What We Know About How BERT Works. Transactions of the Association for Computational Linguistics, 8:842–866, 01 2021. ISSN 2307-387X. doi: 10.1162/tacl_a_00349. URL https://doi.org/10.1162/tacl_a_00349.
  312. Outlier dimensions encode task specific knowledge. In H. Bouamor, J. Pino, and K. Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  14596–14605, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.901. URL https://aclanthology.org/2023.emnlp-main.901.
  313. C. Rushing and N. Nanda. Explorations of self-repair in language models, 2024. URL https://arxiv.org/abs/2402.15390.
  314. Toward transparent ai: A survey on interpreting the inner structures of deep neural networks. Arxiv, 2023. URL https://arxiv.org/abs/2207.13243.
  315. Attention lens: A tool for mechanistically interpreting the attention head information retrieval mechanism, 2023.
  316. S. Sanyal and X. Ren. Discretized integrated gradients for explaining language models. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  10285–10299, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.805. URL https://aclanthology.org/2021.emnlp-main.805.
  317. Inseq: An interpretability toolkit for sequence generation models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pp.  421–435, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-demo.40. URL https://aclanthology.org/2023.acl-demo.40.
  318. Quantifying the plausibility of context reliance in neural machine translation. In The Twelfth International Conference on Learning Representations (ICLR 2024), Vienna, Austria, May 2024. OpenReview. URL https://openreview.net/forum?id=XTHfNGI3zT.
  319. A multimodal automated interpretability agent. Arxiv, 2024. URL https://arxiv.org/abs/2404.14394.
  320. L. S. Shapley. A value for n-person games. In H. W. Kuhn and A. W. Tucker (eds.), Contributions to the Theory of Games II, pp.  307–317. Princeton University Press, Princeton, 1953.
  321. Taking features out of superposition with sparse autoencoders. AI Alignment Forum, 2022. URL https://www.alignmentforum.org/posts/z6QQJbtpkEAX3Aojj/interim-research-report-taking-features-out-of-superposition.
  322. Locating and editing factual associations in mamba, 2024a.
  323. The truth is in there: Improving reasoning with layer-selective rank reduction. In The Twelfth International Conference on Learning Representations, 2024b. URL https://openreview.net/forum?id=ozX92bu8VA.
  324. N. Shazeer. Glu variants improve transformer. ArXiv, 2020.
  325. Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pp.  3145–3153. JMLR.org, 2017.
  326. Computationally efficient measures of internal neuron importance. ArXiv, abs/1807.09946, 2018. URL https://api.semanticscholar.org/CorpusID:50787065.
  327. The probabilities also matter: A more faithful metric for faithfulness of free-text explanations in large language models, 2024.
  328. Deep inside convolutional networks: Visualising image classification models and saliency maps. In The Second International Conference on Learning Representations, 2014. URL http://arxiv.org/abs/1312.6034.
  329. What needs to go right for an induction head? a mechanistic study of in-context learning circuits and their formation, 2024a.
  330. Rethinking interpretability in the era of large language models. ArXiv, abs/2402.01761, 2024b. URL https://api.semanticscholar.org/CorpusID:267412530.
  331. Mimic: Minimally modified counterfactuals in the representation space. Arxiv, 2024c. URL https://arxiv.org/abs/2402.09631.
  332. When explanations lie: Why many modified BP attributions fail. In H. D. III and A. Singh (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.  9046–9057. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/sixt20a.html.
  333. Smoothgrad: removing noise by adding noise, 2017.
  334. P. Smolensky. Neural and conceptual interpretation of PDP models, pp.  390–431. MIT Press, Cambridge, MA, USA, 1986. ISBN 0262631105.
  335. Localizing paragraph memorization in language models, 2024. URL https://arxiv.org/abs/2403.19851.
  336. A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis. In H. Bouamor, J. Pino, and K. Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  7035–7052, Singapore, December 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.435. URL https://aclanthology.org/2023.emnlp-main.435.
  337. Understanding arithmetic reasoning in language models using causal mediation analysis. Arxiv, 2023b. URL https://arxiv.org/abs/2305.15054.
  338. Finding experts in transformer models, 2020.
  339. Self-conditioning pre-trained language models. International Conference on Machine Learning, 2022. URL https://proceedings.mlr.press/v162/cuadros22a/cuadros22a.pdf.
  340. Massive activations in large language models, 2024. URL https://arxiv.org/abs/2402.17762.
  341. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pp.  3319–3328. JMLR.org, 2017.
  342. Attribution patching outperforms automated circuit discovery. Arxiv, 2023. URL https://arxiv.org/abs/2310.10348.
  343. B2T connection: Serving stability and performance in deep transformers. In A. Rogers, J. Boyd-Graber, and N. Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp.  3078–3095, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.192. URL https://aclanthology.org/2023.findings-acl.192.
  344. Language-specific neurons: The key to multilingual capabilities in large language models, 2024. URL https://arxiv.org/abs/2402.16438.
  345. Transformers as support vector machines, 2024.
  346. Circuits updates - february 2024. update on dictionary learning improvements. Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/feb-update/index.html.
  347. BERT rediscovers the classical NLP pipeline. In A. Korhonen, D. Traum, and L. Màrquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  4593–4601, Florence, Italy, July 2019a. Association for Computational Linguistics. doi: 10.18653/v1/P19-1452. URL https://aclanthology.org/P19-1452.
  348. What do you learn from context? probing for sentence structure in contextualized word representations. In International Conference on Learning Representations, 2019b. URL https://openreview.net/forum?id=SJzSgnRcKX.
  349. The language interpretability tool: Extensible, interactive visualizations and analysis for NLP models. In Q. Liu and D. Schlangen (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  107–118, Online, October 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.15. URL https://aclanthology.org/2020.emnlp-demos.15.
  350. Interactive prompt debugging with sequence salience. Arxiv, 2024. URL https://arxiv.org/abs/2404.07498.
  351. Scan and snap: Understanding training dynamics and token composition in 1-layer transformer, 2023.
  352. JoMA: Demystifying multilayer transformers via joint dynamics of MLP and attention. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=LbJqRGNYCf.
  353. R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996. doi: https://doi.org/10.1111/j.2517-6161.1996.tb02080.x. URL https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.2517-6161.1996.tb02080.x.
  354. Linear representations of sentiment in large language models. Arxiv, 2023. URL https://arxiv.org/abs/2310.15154.
  355. W. Timkey and M. van Schijndel. All bark and no bite: Rogue dimensions in transformer language models obscure representational quality. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  4527–4546, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.372. URL https://aclanthology.org/2021.emnlp-main.372.
  356. LLMs represent contextual tasks as compact function vectors. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=AwyxtyMwaG.
  357. Llama 2: Open foundation and fine-tuned chat models. Arxiv, 2023. URL https://arxiv.org/abs/2307.09288.
  358. Lm transparency tool: Interactive tool for analyzing transformer language models. Arxiv, 2024. URL https://arxiv.org/abs/2404.07004.
  359. Activation addition: Steering language models without optimization, 2023.
  360. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. ArXiv, abs/2305.04388, 2023. URL https://api.semanticscholar.org/CorpusID:258556812.
  361. A. Variengien. Some common confusion about induction heads. LessWrong, 2023. URL https://www.lesswrong.com/posts/nJqftacoQGKurJ6fv/some-common-confusion-about-induction-heads.
  362. A. Variengien and E. Winsor. Look before you leap: A universal emergent decomposition of retrieval tasks in language models, 2023. URL https://arxiv.org/abs/2312.10091.
  363. Explaining grokking through circuit efficiency. ArXiv, abs/2309.02390, 2023. URL https://api.semanticscholar.org/CorpusID:261557247.
  364. A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation, 2023. URL https://arxiv.org/abs/2307.03987.
  365. Explanations can reduce overreliance on ai systems during decision-making. Proc. ACM Hum.-Comput. Interact., 7(CSCW1), apr 2023. doi: 10.1145/3579605. URL https://doi.org/10.1145/3579605.
  366. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
  367. Residual networks behave like ensembles of relatively shallow networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pp.  550–558, Red Hook, NY, USA, 2016. Curran Associates Inc. ISBN 9781510838819.
  368. J. Vig. A multiscale visualization of attention in the transformer model. In M. R. Costa-jussà and E. Alfonseca (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp.  37–42, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-3007. URL https://aclanthology.org/P19-3007.
  369. Investigating gender bias in language models using causal mediation analysis. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  12388–12401. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/hash/92650b2e92217715fe312e6fa7b90d82-Abstract.html.
  370. E. Voita and I. Titov. Information-theoretic probing with minimum description length. In B. Webber, T. Cohn, Y. He, and Y. Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  183–196, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.14. URL https://aclanthology.org/2020.emnlp-main.14.
  371. The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives. In K. Inui, J. Jiang, V. Ng, and X. Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  4396–4406, Hong Kong, China, November 2019a. Association for Computational Linguistics. doi: 10.18653/v1/D19-1448. URL https://aclanthology.org/D19-1448.
  372. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In A. Korhonen, D. Traum, and L. Màrquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  5797–5808, Florence, Italy, July 2019b. Association for Computational Linguistics. doi: 10.18653/v1/P19-1580. URL https://aclanthology.org/P19-1580.
  373. Analyzing the source and target contributions to predictions in neural machine translation. In C. Zong, F. Xia, W. Li, and R. Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  1126–1140, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.91. URL https://aclanthology.org/2021.acl-long.91.
  374. Neurons in large language models: Dead, n-gram, positional. Arxiv, 2023. URL https://arxiv.org/abs/2309.04827.
  375. Transformers learn in-context by gradient descent. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  35151–35174. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/von-oswald23a.html.
  376. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International Conference on Learning Representations, 2023a. URL https://openreview.net/forum?id=NpsVSN6o4ul.
  377. Llmcheckup: Conversational examination of large language models via interpretability tools, 2024.
  378. Knowledge editing for large language models: A survey. ArXiv, abs/2310.16218, 2023b. URL https://api.semanticscholar.org/CorpusID:264487359.
  379. Finding skill neurons in pre-trained transformer-based language models. In Y. Goldberg, Z. Kozareva, and Y. Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  11132–11152, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.765. URL https://aclanthology.org/2022.emnlp-main.765.
  380. On the safety of interpretable machine learning: A maximum deviation approach. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  9866–9880. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/402e12102d6ec3ea3df40ce1b23d423a-Paper-Conference.pdf.
  381. Thinking like transformers. In M. Meila and T. Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.  11080–11090. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/weiss21a.html.
  382. Transformers are uninterpretable with myopic methods: a case study with bounded dyck grammars. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.  38723–38766. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/79ba1b827d3fc58e129d1cbfc8ff69f2-Paper-Conference.pdf.
  383. Gradient-based language model red teaming, 2024.
  384. Transformers: State-of-the-art natural language processing. In Q. Liu and D. Schlangen (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  38–45, Online, October 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.6. URL https://aclanthology.org/2020.emnlp-demos.6.
  385. B. Wright and L. Sharkey. Addressing feature suppression in saes. AI ALIGNMENT FORUM, 2024. URL https://www.alignmentforum.org/posts/3JuSjTZyMzaSeTxKk/addressing-feature-suppression-in-saes.
  386. Retrieval head mechanistically explains long-context factuality. Arxiv, 2024a. URL https://arxiv.org/abs/2404.15574.
  387. Causal proxy models for concept-based model explanations. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023a.
  388. Interpretability at scale: Identifying causal mechanisms in alpaca. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.  78205–78226. Curran Associates, Inc., 2023b. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/f6a8b109d4d4fd64c75e94aaf85d9697-Paper-Conference.pdf.
  389. Reft: Representation finetuning for language models, 2024b.
  390. pyvene: A library for understanding and improving pytorch models via interventions, 2024c.
  391. A reply to makelov et al. (2023)’s "interpretability illusion" arguments, 2024d.
  392. Efficient streaming language models with attention sinks. Arxiv, 2023. URL https://arxiv.org/abs/2309.17453.
  393. An explanation of in-context learning as implicit bayesian inference. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=RdJVFCHjUMI.
  394. On layer normalization in the transformer architecture. In Proceedings of the 37th International Conference on Machine Learning. JMLR.org, 2020.
  395. Local interpretation of transformer based on linear decomposition. In A. Rogers, J. Boyd-Graber, and N. Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  10270–10287, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.572. URL https://aclanthology.org/2023.acl-long.572.
  396. Editing large language models: Problems, methods, and opportunities. In H. Bouamor, J. Pino, and K. Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  10222–10240, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.632. URL https://aclanthology.org/2023.emnlp-main.632.
  397. K. Yin and G. Neubig. Interpreting language models with contrastive explanations. In Y. Goldberg, Z. Kozareva, and Y. Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  184–198, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.14. URL https://aclanthology.org/2022.emnlp-main.14.
  398. Characterizing mechanisms for factual recall in language models. In H. Bouamor, J. Pino, and K. Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  9924–9959, Singapore, December 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.615. URL https://aclanthology.org/2023.emnlp-main.615.
  399. White-box transformers via sparse rate reduction. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b. URL https://openreview.net/forum?id=THfl8hdVxH.
  400. Z. Yu and S. Ananiadou. Locating factual knowledge in large language models: Exploring the residual stream and analyzing subvalues in vocabulary space, 2024. URL https://arxiv.org/abs/2312.12141.
  401. Attention satisfies: A constraint-satisfaction lens on factual errors of language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=gfFVATffPd.
  402. M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars (eds.), Computer Vision – ECCV 2014, pp.  818–833, Cham, 2014. Springer International Publishing. ISBN 978-3-319-10590-1.
  403. B. Zhang and R. Sennrich. Root mean square layer normalization. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2019. Curran Associates Inc.
  404. F. Zhang and N. Nanda. Towards best practices of activation patching in language models: Metrics and methods. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Hf17y6u9BC.
  405. Opt: Open pre-trained transformer language models, 2022.
  406. Z. Zhao and B. Shan. Reagent: A model-agnostic feature attribution method for generative language models, 2024.
  407. The clock and the pizza: Two stories in mechanistic explanation of neural networks. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.  27223–27250. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/56cbfbf49937a0873d451343ddc8c57d-Paper-Conference.pdf.
  408. Object detectors emerge in deep scene cnns. In International Conference on Learning Representations (ICLR), 2015.
  409. What algorithms can transformers learn? a study in length generalization. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=AssIuHnmHX.
  410. Representation engineering: A top-down approach to ai transparency. Arxiv, 2023. URL https://arxiv.org/abs/2310.01405.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Javier Ferrando (15 papers)
  2. Gabriele Sarti (21 papers)
  3. Arianna Bisazza (43 papers)
  4. Marta R. Costa-jussà (73 papers)
Citations (31)

Summary

Exploring Transformer LLMs: A Comprehensive Look at Their Inner Workings

Understanding the Transformer Components and Their Roles

Transformers have become the backbone of modern NLP applications, driving advancements in various tasks. Their architecture is built around the principle of self-attention, which allows the model to weigh the importance of different parts of the input data differently. This architecture is broken down into several key components:

  • Embedding layer: Maps tokens (words or subwords) to high-dimensional vectors. This is where the input tokens first become representations that the model can work with.
  • Attention mechanism: Determines how much focus or 'attention' the model should give to each part of the input when predicting a word or processing a subword. It’s structured around queries, keys, and values—essentially matching a query (the current input token) against all keys (all other tokens) to produce a weighted sum of values (token representations).
  • Feedforward neural networks (FFN): After attention has been applied, the Transformer applies a position-wise FFN to each position separately and identically. This part of the model can be thought of as providing additional transformation capabilities to the model, allowing it to better make correlations between different words based on the learned attention weights.
  • Normalization and residual connections: These are used between different layers in the Transformer to help stabilize the learning process and allow deeper networks by helping with the vanishing gradient problem.

Decoding the Mechanisms Within: Attention Details

Attention heads within the Transformers can exhibit a variety of behaviors and specialize in certain operations, which might include focusing on certain parts of a sentence, determining syntax relations, or managing sequence positions and their relations. For example:

  • Positional heads: Manage information about where each word is in the sequence, which is crucial for understanding the structure of a sentence.
  • Syntactic heads: These might focus on determining grammatical structures within the input data, helping the model understand complex language rules.
  • Induction heads: These are interesting as they might be used to complete patterns in data, recognizing when certain sequences tend to occur.

Understanding how these different heads operate can provide insights into how Transformers manage to extract meaning from text and make accurate predictions or generate coherent text sequences.

Behaviour Localization in Transformers

One of the key areas of Transformer interpretability involves understanding which parts of the model are responsible for specific outputs. This not only helps in debugging models but also in ensuring that they operate fairly and without biases. Techniques like input and model component attribution are crucial here. They help us determine:

  • Which input parts influence model decisions: For example, knowing which words in a sentence led to a particular sentiment classification.
  • How different components like attention heads contribute to decisions: This can include understanding whether certain heads are focusing more on syntactic structures or positional information.

Sifting Through Layers: Information Decoding

Decoding what each layer and component within a Transformer is doing can be likened to putting together pieces of a complex puzzle. Each layer could be encoding different types of information from syntactic data to contextual understanding. Tools and techniques like probing layers or analyzing internal activations provide a window into these operations.

For instance, probing can tell us if a particular layer is holding onto syntactic information more than others, which might influence how subsequent layers process semantic content.

Future Pathways in AI Transparency

Looking ahead, the journey towards fully understanding and interpreting Transformer-based models is far from complete. Efforts need to continue in developing more sophisticated tools that can provide even deeper insights into these complex models. Moreover, as these models become more integrated into societal applications, ensuring their interpretability will be key in maintaining trust and reliability in AI systems.

In conclusion, while Transformers are a powerful tool in the AI arsenal, unlocking their full potential safely and ethically requires continuous and rigorous exploration of their inner workings. Understanding these details not only helps in enhancing model performance but also ensures that AI advancements are equitable and comprehensible to all.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com