Trainable Transformer in Transformer (2307.01189v2)
Abstract: Recent works attribute the capability of in-context learning (ICL) in large pre-trained LLMs to implicitly simulating and fine-tuning an internal model (e.g., linear or 2-layer MLP) during inference. However, such constructions require large memory overhead, which makes simulation of more sophisticated internal models intractable. In this work, we propose an efficient construction, Transformer in Transformer (in short, TinT), that allows a transformer to simulate and fine-tune complex models internally during inference (e.g., pre-trained LLMs). In particular, we introduce innovative approximation techniques that allow a TinT model with less than 2 billion parameters to simulate and fine-tune a 125 million parameter transformer model within a single forward pass. TinT accommodates many common transformer variants and its design ideas also improve the efficiency of past instantiations of simple models inside transformers. We conduct end-to-end experiments to validate the internal fine-tuning procedure of TinT on various LLMing and downstream tasks. For example, even with a limited one-step budget, we observe TinT for a OPT-125M model improves performance by 4-16% absolute on average compared to OPT-125M. These findings suggest that large pre-trained LLMs are capable of performing intricate subroutines. To facilitate further work, a modular and extensible codebase for TinT is included.
- What learning algorithm is in-context learning? investigations with linear models. arXiv preprint arXiv:2211.15661, 2022.
- Exploring length generalization in large language models. arXiv preprint arXiv:2207.04901, 2022.
- Using fast weights to attend to the recent past, 2016.
- On the ability and limitations of transformers to recognize formal languages. arXiv preprint arXiv:2009.11264, 2020.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Data distributional properties drive emergent in-context learning in transformers. Advances in Neural Information Processing Systems, 35:18878–18891, 2022.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- A toy model of universality: Reverse engineering how networks learn group operations. arXiv preprint arXiv:2302.03025, 2023.
- Meta-learning fast weight language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9751–9757, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.661.
- Why can gpt learn in-context? language models secretly perform gradient descent as meta-optimizers, 2022.
- Inductive biases and variable creation in self-attention mechanisms. In International Conference on Machine Learning, pages 5793–5831. PMLR, 2022.
- A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021.
- Looped transformers as programmable computers, 2023.
- A theory of emergent in-context learning as implicit structure induction. arXiv preprint arXiv:2303.07971, 2023.
- Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
- Surface form competition: Why the highest probability answer isn’t always right. arXiv preprint arXiv:2104.08315, 2021.
- Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 168–177, 2004.
- Going beyond linear transformers with recurrent fast weight programmers. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=ot2ORiBqTa1.
- Dynamic evaluation of transformer language models. arXiv preprint arXiv:1904.08378, 2019.
- How to fine-tune vision models with sgd, 2022.
- Tracr: Compiled transformers as a laboratory for interpretability. arXiv preprint arXiv:2301.05062, 2023.
- Transformers learn shortcuts to automata. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=De4FYqjFueZ.
- Generating wikipedia by summarizing long sequences. arXiv preprint arXiv:1801.10198, 2018.
- Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
- Progress measures for grokking via mechanistic interpretability. arXiv preprint arXiv:2301.05217, 2023.
- In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022.
- Bo Pang and Lillian Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pages 271–278, 2004.
- Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 115–124, 2005.
- Attention is turing-complete. Journal of Machine Learning Research, 22(75):1–35, 2021. URL http://jmlr.org/papers/v22/20-302.html.
- Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
- On the turing completeness of modern neural network architectures. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=HyGBdo0qFm.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- A mathematical exploration of why language models help solve downstream tasks. arXiv preprint arXiv:2010.03648, 2020.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
- Linear transformers are secretly fast weight memory systems. CoRR, abs/2102.11174, 2021. URL https://arxiv.org/abs/2102.11174.
- Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642, 2013.
- Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Transformers learn in-context by gradient descent, 2022.
- Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In NeurIPS ML Safety Workshop, 2022. URL https://openreview.net/forum?id=rvi3Wa768B-.
- Statistically meaningful approximation: a case study on approximating turing machines with transformers. CoRR, abs/2107.13163, 2021. URL https://arxiv.org/abs/2107.13163.
- Emergent abilities of large language models. Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https://openreview.net/forum?id=yzkSU5zdwD. Survey Certification.
- Thinking like transformers. In International Conference on Machine Learning, pages 11080–11090. PMLR, 2021.
- Annotating expressions of opinions and emotions in language. Language resources and evaluation, 39:165–210, 2005.
- Group normalization. In Proceedings of the European conference on computer vision (ECCV), pages 3–19, 2018.
- An explanation of in-context learning as implicit bayesian inference. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=RdJVFCHjUMI.
- Self-attention networks can process bounded hierarchical languages. arXiv preprint arXiv:2105.11115, 2021.
- Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- Character-level convolutional networks for text classification. Advances in neural information processing systems, 28, 2015.