Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Backward Lens: Projecting Language Model Gradients into the Vocabulary Space (2402.12865v1)

Published 20 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Understanding how Transformer-based LLMs (LMs) learn and recall information is a key goal of the deep learning community. Recent interpretability methods project weights and hidden states obtained from the forward pass to the models' vocabularies, helping to uncover how information flows within LMs. In this work, we extend this methodology to LMs' backward pass and gradients. We first prove that a gradient matrix can be cast as a low-rank linear combination of its forward and backward passes' inputs. We then develop methods to project these gradients into vocabulary items and explore the mechanics of how new information is stored in the LMs' neurons.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Yonatan Belinkov and James Glass. 2019. Analysis methods in neural language processing: A survey. Transactions of the Association for Computational Linguistics, 7:49–72.
  2. Eliciting latent predictions from transformers with the tuned lens. arXiv preprint arXiv:2303.08112.
  3. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623.
  4. Christopher Bishop. 2006. Pattern recognition and machine learning. Springer google schola, 2:531–537.
  5. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread. Https://transformer-circuits.pub/2023/monosemantic-features/index.html.
  6. Optimizing relevance maps of vision transformers improves robustness. Advances in Neural Information Processing Systems, 35:33618–33632.
  7. Kevin Clark. 2017. Computing neural network gradients.
  8. Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600.
  9. Knowledge neurons in pretrained transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8493–8502.
  10. Analyzing transformers in embedding space. arXiv preprint arXiv:2209.02535.
  11. Jump to conclusions: Short-cutting transformers with linear transformations. arXiv preprint arXiv:2303.09435.
  12. “so what if chatgpt wrote it?” multidisciplinary perspectives on opportunities, challenges and implications of generative conversational ai for research, practice and policy. International Journal of Information Management, 71:102642.
  13. A mathematical framework for transformer circuits.
  14. Toy models of superposition.
  15. Dissecting recall of factual associations in auto-regressive language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12216–12235, Singapore. Association for Computational Linguistics.
  16. LM-debugger: An interactive tool for inspection and intervention in transformer-based language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 12–21, Abu Dhabi, UAE. Association for Computational Linguistics.
  17. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 30–45, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  18. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495.
  19. Patchscope: A unifying framework for inspecting hidden representations of language models. arXiv preprint arXiv:2401.06102.
  20. Knowledge is a region in weight space for fine-tuned language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1350–1370, Singapore. Association for Computational Linguistics.
  21. Understanding transformer memorization recall through idioms. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023, Dubrovnik, Croatia, May 2-6, 2023, pages 248–264. Association for Computational Linguistics.
  22. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  23. Editing models with task arithmetic. In The Eleventh International Conference on Learning Representations.
  24. Conceptual understanding of convolutional neural network-a deep learning approach. Procedia computer science, 132:679–688.
  25. Shahar Katz and Yonatan Belinkov. 2023. VISIT: Visualizing and interpreting the semantic information flow of transformers. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14094–14113, Singapore. Association for Computational Linguistics.
  26. DP Kingma. 2014. Adam: a method for stochastic optimization. In Int Conf Learn Represent.
  27. Y Le Cun. 1988. A theoretical framework for backpropagation. In Proceedings of the 1988 Connectionist Models Summer School.
  28. Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems, 36.
  29. Mass-editing memory in a transformer. International Conference on Learning Representations.
  30. Pointer sentinel mixture models. In International Conference on Learning Representations.
  31. Using captum to explain generative language models. In Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), pages 165–173.
  32. Beren Millidge and Sid Black. 2022. The singular value decompositions of transformer weight matrices are highly interpretable.
  33. Fast model editing at scale. In International Conference on Learning Representations.
  34. nostalgebraist. 2020. interpreting gpt: the logit lens.
  35. Naturally occurring equivariance in neural networks. Distill, 5(12):e00024–004.
  36. Improving language understanding by generative pre-training.
  37. Language models are unsupervised multitask learners. OpenAI blog.
  38. What are you token about? dense retrieval as distributions over the vocabulary. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2481–2498, Toronto, Canada. Association for Computational Linguistics.
  39. Learning representations by back-propagating errors. nature, 323(6088):533–536.
  40. Explainable artificial intelligence: Understanding, visualizing and interpreting deep learning models. CoRR, abs/1708.08296.
  41. Soumya Sanyal and Xiang Ren. 2021. Discretized integrated gradients for explaining language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10285–10299.
  42. Inseq: An interpretability toolkit for sequence generation models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 421–435, Toronto, Canada. Association for Computational Linguistics.
  43. Adi Simhi and Shaul Markovitch. 2023. Interpreting embedding spaces by conceptualization. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1704–1719, Singapore. Association for Computational Linguistics.
  44. Deep inside convolutional networks: visualising image classification models and saliency maps. In Proceedings of the International Conference on Learning Representations (ICLR). ICLR.
  45. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
  46. Codebook features: Sparse and discrete interpretability for neural networks. arXiv preprint arXiv:2310.17230.
  47. Scan and snap: Understanding training dynamics and token composition in 1-layer transformer. arXiv preprint arXiv:2305.16380.
  48. Linear representations of sentiment in large language models. arXiv preprint arXiv:2310.15154.
  49. Function vectors in large language models. arXiv preprint arXiv:2310.15213.
  50. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  51. Attention is all you need. Advances in neural information processing systems, 30.
  52. Label words are anchors: An information flow perspective for understanding in-context learning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9840–9855, Singapore. Association for Computational Linguistics.
  53. Quan-shi Zhang and Song-Chun Zhu. 2018. Visual interpretability for deep learning: a survey. Frontiers of Information Technology & Electronic Engineering, 19(1):27–39.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Shahar Katz (5 papers)
  2. Yonatan Belinkov (111 papers)
  3. Mor Geva (58 papers)
  4. Lior Wolf (217 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets