Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals (2402.11655v2)

Published 18 Feb 2024 in cs.CL

Abstract: Interpretability research aims to bridge the gap between empirical success and our scientific understanding of the inner workings of LLMs. However, most existing research focuses on analyzing a single mechanism, such as how models copy or recall factual knowledge. In this work, we propose a formulation of competition of mechanisms, which focuses on the interplay of multiple mechanisms instead of individual mechanisms and traces how one of them becomes dominant in the final prediction. We uncover how and where mechanisms compete within LLMs using two interpretability methods: logit inspection and attention modification. Our findings show traces of the mechanisms and their competition across various model components and reveal attention positions that effectively control the strength of certain mechanisms. Code: https://github.com/francescortu/comp-mech. Data: https://huggingface.co/datasets/francescortu/comp-mech.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Guillaume Alain and Yoshua Bengio. 2016. Understanding intermediate layers using linear classifier probes. ArXiv, abs/1610.01644.
  2. Gemini: A family of highly capable multimodal models. CoRR, abs/2312.11805.
  3. What do neural machine translation models learn about morphology? In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 861–872, Vancouver, Canada. Association for Computational Linguistics.
  4. Eliciting latent predictions from transformers with the tuned lens. CoRR, abs/2303.08112.
  5. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 2397–2430. PMLR.
  6. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  7. Towards automated circuit discovery for mechanistic interpretability. CoRR, abs/2304.14997.
  8. What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2126–2136, Melbourne, Australia. Association for Computational Linguistics.
  9. Analyzing transformers in embedding space. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 16124–16170. Association for Computational Linguistics.
  10. Measuring and improving consistency in pretrained language models. Trans. Assoc. Comput. Linguistics, 9:1012–1031.
  11. A mathematical framework for transformer circuits. Transformer Circuits Thread. Https://transformer-circuits.pub/2021/framework/index.html.
  12. Dissecting recall of factual associations in auto-regressive language models. CoRR, abs/2304.14767.
  13. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 30–45, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  14. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  15. Overthinking the truth: Understanding how language models process false demonstrations. CoRR, abs/2307.09476.
  16. How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model. CoRR, abs/2305.00586.
  17. John Hewitt and Percy Liang. 2019. Designing and interpreting probes with control tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2733–2743, Hong Kong, China. Association for Computational Linguistics.
  18. John Hewitt and Christopher D. Manning. 2019. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129–4138, Minneapolis, Minnesota. Association for Computational Linguistics.
  19. Visualisation and ’diagnostic classifiers’ reveal how recurrent and recursive neural networks process hierarchical structure. J. Artif. Intell. Res., 61:907–926.
  20. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423–438.
  21. Copy suppression: Comprehensively understanding an attention head. CoRR, abs/2310.04625.
  22. Locating and editing factual associations in gpt. arXiv preprint arXiv:2202.05262.
  23. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, pages 3111–3119.
  24. Neel Nanda and Joseph Bloom. 2022. Transformerlens. https://github.com/neelnanda-io/TransformerLens.
  25. Progress measures for grokking via mechanistic interpretability. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  26. Nostalgebraist. 2020. interpreting gpt: the logit lens. Accessed: Nov 2023.
  27. Zoom in: An introduction to circuits. Distill, 5(3):e00024–001.
  28. In-context learning and induction heads. CoRR, abs/2209.11895.
  29. OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
  30. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473, Hong Kong, China. Association for Computational Linguistics.
  31. Language models are unsupervised multitask learners. OpenAI Blog, 1(8).
  32. Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50, Valletta, Malta. ELRA. http://is.muni.cz/publication/884893/en.
  33. What do you learn from context? probing for sentence structure in contextualized word representations. ArXiv, abs/1905.06316.
  34. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
  35. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  36. Inverse scaling can become u-shaped. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 15580–15591. Association for Computational Linguistics.
  37. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  38. Characterizing mechanisms for factual recall in language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9924–9959, Singapore. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Francesco Ortu (4 papers)
  2. Zhijing Jin (68 papers)
  3. Diego Doimo (11 papers)
  4. Mrinmaya Sachan (124 papers)
  5. Alberto Cazzaniga (12 papers)
  6. Bernhard Schölkopf (412 papers)
Citations (12)

Summary

  • The paper introduces the 'competition of mechanisms' framework to trace how LLMs prioritize factual and counterfactual information.
  • It employs logit inspection and attention modification on models like GPT-2 and Pythia-6.9B to uncover layer-wise roles and specialized attention heads.
  • Findings reveal that attention blocks mainly promote counterfactual predictions while larger models favor factual recall when semantic similarity increases.

Competition of Mechanisms: Tracing How LLMs Handle Facts and Counterfactuals

The paper "Competition of Mechanisms: Tracing How LLMs Handle Facts and Counterfactuals" investigates the interaction of multiple mechanisms within LLMs to understand which mechanisms predominate when processing factual and counterfactual information. This work contributes to the field of interpretability research, departing from existing studies which generally focus on individual mechanisms such as knowledge recall or token copying within LLMs. Instead, it introduces a new framework called the "competition of mechanisms," to explore how various underlying mechanisms interact and lead to a model’s ultimate decision.

Methodology

The paper employs two primary interpretability methods: logit inspection and attention modification. Logit inspection involves projecting the internal state of the residual stream to the vocabulary space using an unembedding matrix, allowing for the examination of token-specific logits throughout different layers of the model. Attention modification involves altering the attention weights in specific matrices to influence model behavior intentionally.

These methods are applied to autoregressive LLMs, specifically GPT-2 and Pythia-6.9B, using a dataset where factual attributes conflict against counterfactual statements. By analyzing the logits and attention patterns, the authors aim to identify where and how factual or counterfactual knowledge becomes dominant, tracing the contributions from different model components such as attention blocks and MLP layers.

Key Findings

  1. Layer-Wise Mechanism Dynamics: The paper finds that, in the initial layers of GPT-2, factual knowledge is predominantly encoded in the subject position, while counterfactual information is primarily stored in the attribute position. As information propagates through the model, the attention blocks play a significant role in transferring this information to the final sequence position where it influences the prediction.
  2. Component Contributions: Attention blocks substantially contribute to promoting counterfactual predictions, while MLPs contribute to a lesser extent. Only in the final layer does attention block modification slightly favor factual knowledge, but overall, the attention heads read more from the attribute to influence the model's output.
  3. Role of Specific Attention Heads: Certain attention heads were found to be highly specialized, either promoting the factual or counterfactual token. These heads showed a pronounced attention pattern to the attribute position. Enhancing the attention scores for these specialized heads increased the models' rates of predicting the factual token.
  4. Impact of Semantic Similarity: The paper observes that the competition between mechanisms intensifies with increased semantic similarity between the factual and counterfactual attributes. Larger models exhibit stronger reliance on factual recall in such scenarios, suggesting an enhanced capacity to store and retrieve factual information as model size grows.

Implications and Future Work

This work underscores the nuanced interactions within LLMs when presented with competing mechanisms, emphasizing the importance of understanding such dynamics to improve both interpretability and reliability of LLMs. The findings have practical implications for enhancing model accuracy, particularly in scenarios where factual correctness is essential.

Future developments could build on this framework to explore larger and more complex models, extending the analyses to a wider variety of datasets and linguistic structures. Understanding the variability in prompt structures and exploring additional mechanisms could lead to more sophisticated tuning of attention mechanisms, potentially enabling better control over factual and counterfactual predictions in LLM applications.

In summary, this paper advances our comprehension of the internal mechanics of LLMs by presenting a novel approach to interpret how these models prioritize between factual recall and counterfactual adaptation. The insights gained from this research provide a solid foundation for improving current models and addressing challenges associated with model interpretability and reliability.

Youtube Logo Streamline Icon: https://streamlinehq.com