Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models (2311.04131v6)
Abstract: While transformer models exhibit strong capabilities on linguistic tasks, their complex architectures make them difficult to interpret. Recent work has aimed to reverse engineer transformer models into human-readable representations called circuits that implement algorithmic functions. We extend this research by analyzing and comparing circuits for similar sequence continuation tasks, which include increasing sequences of Arabic numerals, number words, and months. By applying circuit interpretability analysis, we identify a key sub-circuit in both GPT-2 Small and Llama-2-7B responsible for detecting sequence members and for predicting the next member in a sequence. Our analysis reveals that semantically related sequences rely on shared circuit subgraphs with analogous roles. Additionally, we show that this sub-circuit has effects on various math-related prompts, such as on intervaled circuits, Spanish number word and months continuation, and natural language word problems. Overall, documenting shared computational structures enables better model behavior predictions, identification of errors, and safer editing procedures. This mechanistic understanding of transformers is a critical step towards building more robust, aligned, and interpretable LLMs.
- Concrete problems in ai safety. arXiv: Learning, abs/1606.06565.
- System iii: Learning with domain knowledge for safety constraints.
- Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward responsible ai. Information Fusion, 58:82–115.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Sparks of artificial general intelligence: Early experiments with gpt-4.
- A literature survey of recent advances in chatbots. Information, 13(1):41.
- Towards automated circuit discovery for mechanistic interpretability. arXiv preprint arXiv:2304.14997.
- Analyzing transformers in embedding space.
- Toy models of superposition. Transformer Circuits Thread. Https://transformer-circuits.pub/2022/toy_model/index.html.
- A mathematical framework for transformer circuits. Transformer Circuits Thread. Https://transformer-circuits.pub/2021/framework/index.html.
- Neuron to graph: Interpreting language model neurons at scale. In Proceedings of the Trustworthy and Reliable Large-Scale Machine Learning Models Workshop at ICLR.
- Transformer feed-forward layers are key-value memories. arXiv preprint arXiv:2012.14913.
- Localizing model behavior with path patching.
- Successor heads: Recurring, interpretable attention heads in the wild.
- Wes Gurnee and Max Tegmark. 2023. Language models represent space and time.
- How does gpt-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model.
- Dan Hendrycks and Mantas Mazeika. 2022. X-risk analysis for ai research. arXiv preprint arXiv:2206.05862.
- Detecting edit failures in large language models: An improved specificity benchmark.
- Interpreting reward models in rlhf-tuned language models using sparse autoencoders.
- Copy suppression: Comprehensively understanding an attention head.
- Locating and editing factual associations in gpt.
- Circuit component reuse across tasks in transformer language models.
- The larger they are, the harder they fail: Language models do not recognize identifier swaps in python.
- Jesse Mu and Jacob Andreas. 2020. Compositional explanations of neurons. Advances in Neural Information Processing Systems, 33:17153–17163.
- Progress measures for grokking via mechanistic interpretability.
- Nostalgebraist. 2020. Interpreting gpt: The logit lens. https://www.alignmentforum.org/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens. Accessed: 14 December 2023.
- In-context learning and induction heads. Transformer Circuits Thread. Https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
- Philip Quirke and Fazl Barez. 2023. Understanding addition in transformers.
- Language models are unsupervised multitask learners.
- Toward transparent ai: A survey on interpreting the inner structures of deep neural networks.
- Activation addition: Steering language models without optimization.
- Attention is all you need. Advances in neural information processing systems, 30.
- Investigating gender bias in language models using causal mediation analysis. Advances in neural information processing systems, 33:12388–12401.
- Interpretability in the wild: a circuit for indirect object identification in gpt-2 small.
- Shifting machine learning for healthcare from development to deployment and from models to data. Nature Biomedical Engineering, 6:1330–1345.
- A comprehensive survey on transfer learning.
- Michael Lan (4 papers)
- Fazl Barez (42 papers)
- Philip Torr (172 papers)