Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

In-Context Learning of a Linear Transformer Block: Benefits of the MLP Component and One-Step GD Initialization (2402.14951v1)

Published 22 Feb 2024 in stat.ML, cs.CL, and cs.LG

Abstract: We study the \emph{in-context learning} (ICL) ability of a \emph{Linear Transformer Block} (LTB) that combines a linear attention component and a linear multi-layer perceptron (MLP) component. For ICL of linear regression with a Gaussian prior and a \emph{non-zero mean}, we show that LTB can achieve nearly Bayes optimal ICL risk. In contrast, using only linear attention must incur an irreducible additive approximation error. Furthermore, we establish a correspondence between LTB and one-step gradient descent estimators with learnable initialization ($\mathsf{GD}\text{-}\mathbf{\beta}$), in the sense that every $\mathsf{GD}\text{-}\mathbf{\beta}$ estimator can be implemented by an LTB estimator and every optimal LTB estimator that minimizes the in-class ICL risk is effectively a $\mathsf{GD}\text{-}\mathbf{\beta}$ estimator. Finally, we show that $\mathsf{GD}\text{-}\mathbf{\beta}$ estimators can be efficiently optimized with gradient flow, despite a non-convex training objective. Our results reveal that LTB achieves ICL by implementing $\mathsf{GD}\text{-}\mathbf{\beta}$, and they highlight the role of MLP layers in reducing approximation error.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Transformers learn to implement preconditioned gradient descent for in-context learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  2. In-context learning through the Bayesian prism. In The Twelfth International Conference on Learning Representations, 2024.
  3. What learning algorithm is in-context learning? Investigations with linear models. In The Eleventh International Conference on Learning Representations, 2022.
  4. Transformers as statisticians: Provable in-context learning with in-context algorithm selection. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  5. Understanding in-context learning in transformers and LLMs by learning to learn discrete functions. In The Twelfth International Conference on Learning Representations, 2024.
  6. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020.
  7. Meta-learning via language model in-context tuning. arXiv preprint arXiv:2110.07814, 2021.
  8. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  9. Knowledge neurons in pretrained transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  8493–8502, 2022.
  10. Why can GPT learn in-context? Language models secretly perform gradient descent as meta-optimizers. In Findings of the Association for Computational Linguistics: ACL 2023, 2023.
  11. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  12. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pp.  1126–1135. PMLR, 2017.
  13. How does representation impact in-context learning: An exploration on a synthetic task. arXiv preprint arXiv:2309.06054, 2023.
  14. What can transformers learn in-context? A case study of simple function classes. Advances in Neural Information Processing Systems, 35:30583–30598, 2022.
  15. Transformer feed-forward layers are key-value memories. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021.
  16. How do transformers learn in-context beyond simple functions? A case study on learning with representations. In The Twelfth International Conference on Learning Representations, 2024.
  17. Supervised pretraining can learn in-context reinforcement learning. arXiv preprint arXiv:2306.14892, 2023.
  18. Transformers as algorithms: Generalization and stability in in-context learning. In International Conference on Machine Learning, pp.  19565–19594. PMLR, 2023.
  19. Transformers as decision makers: Provable in-context reinforcement learning via supervised pretraining. In The Twelfth International Conference on Learning Representations, 2024.
  20. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023.
  21. One step of gradient descent is provably the optimal in-context learner with one layer of linear self-attention. In The Twelfth International Conference on Learning Representations, 2024.
  22. On a product of positive semidefinite matrices. Linear algebra and its applications, 295(1-3):3–6, 1999.
  23. Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems, 35:17359–17372, 2022.
  24. Metaicl: Learning to learn in context. arXiv preprint arXiv:2110.15943, 2021.
  25. OpenAI. GPT-4 technical report, 2023.
  26. Transformers can optimally learn regression mixture models. arXiv preprint arXiv:2311.08362, 2023.
  27. The matrix cookbook. Technical University of Denmark, 7(15):510, 2008.
  28. Improving language understanding by generative pre-training, 2018.
  29. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  30. Pretraining task diversity and the emergence of non-Bayesian in-context learning for regression. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  31. Seber, G. A. A matrix handbook for statisticians. John Wiley & Sons, 2008.
  32. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  33. Benign overfitting in ridge regression. J. Mach. Learn. Res., 24:123–1, 2023.
  34. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
  35. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pp.  35151–35174. PMLR, 2023.
  36. Huggingface’s transformers: State-of-the-art natural language processing, 2020.
  37. How many pretraining tasks are needed for in-context learning of linear regression? In The Twelfth International Conference on Learning Representations, 2024.
  38. Trained transformers learn linear models in-context. arXiv preprint arXiv:2306.09927, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Ruiqi Zhang (58 papers)
  2. Jingfeng Wu (34 papers)
  3. Peter L. Bartlett (86 papers)
Citations (11)
X Twitter Logo Streamline Icon: https://streamlinehq.com