Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Asymptotic theory of in-context learning by linear attention (2405.11751v1)

Published 20 May 2024 in stat.ML, cond-mat.dis-nn, and cs.LG

Abstract: Transformers have a remarkable ability to learn and execute tasks based on examples provided within the input itself, without explicit prior training. It has been argued that this capability, known as in-context learning (ICL), is a cornerstone of Transformers' success, yet questions about the necessary sample complexity, pretraining task diversity, and context length for successful ICL remain unresolved. Here, we provide a precise answer to these questions in an exactly solvable model of ICL of a linear regression task by linear attention. We derive sharp asymptotics for the learning curve in a phenomenologically-rich scaling regime where the token dimension is taken to infinity; the context length and pretraining task diversity scale proportionally with the token dimension; and the number of pretraining examples scales quadratically. We demonstrate a double-descent learning curve with increasing pretraining examples, and uncover a phase transition in the model's behavior between low and high task diversity regimes: In the low diversity regime, the model tends toward memorization of training tasks, whereas in the high diversity regime, it achieves genuine in-context learning and generalization beyond the scope of pretrained tasks. These theoretical insights are empirically validated through experiments with both linear attention and full nonlinear Transformer architectures.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  2. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv, 2021.
  3. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  4. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  5. Improving language understanding by generative pre-training, 2018.
  6. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  7. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  8. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  9. Predictability and surprise in large generative models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1747–1764, 2022.
  10. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
  11. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https://openreview.net/forum?id=yzkSU5zdwD.
  12. In-context learning and induction heads. Transformer Circuits Thread, 2022. URL https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
  13. Are emergent abilities of large language models a mirage? In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=ITw9edRDlD.
  14. Hidden progress in deep learning: Sgd learns parities near the computational limit. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 21750–21764. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/884baf65392170763b27c914087bde01-Paper-Conference.pdf.
  15. Data distributional properties drive emergent in-context learning in transformers, 2022.
  16. The transient nature of emergent in-context learning in transformers, 2023.
  17. Birth of a transformer: A memory viewpoint. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 1560–1588. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/0561738a239a995c8cd2ef0e50cfa4fd-Paper-Conference.pdf.
  18. Pretraining task diversity and the emergence of non-bayesian in-context learning for regression. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 14228–14246. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/2e10b2c2e1aa4f8083c37dfe269873f8-Paper-Conference.pdf.
  19. Gautam Reddy. The mechanistic basis of data dependence and abrupt learning in an in-context classification task. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=aN4Jf6Cx69.
  20. Transformers as statisticians: Provable in-context learning with in-context algorithm selection. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 57125–57211. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/b2e63e36c57e153b9015fece2352a9f9-Paper-Conference.pdf.
  21. Transformers as algorithms: Generalization and stability in in-context learning, 2023.
  22. What learning algorithm is in-context learning? investigations with linear models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=0g0X4H8yN4I.
  23. Transformers learn to implement preconditioned gradient descent for in-context learning, 2023.
  24. Transformers learn higher-order optimization methods for in-context learning: A study with linear models, 2023.
  25. Transformers learn in-context by gradient descent. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 35151–35174. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/von-oswald23a.html.
  26. In-context learning of a linear transformer block: Benefits of the mlp component and one-step gd initialization, 2024a.
  27. Trained transformers learn linear models in-context. Journal of Machine Learning Research, 25(49):1–55, 2024b. URL http://jmlr.org/papers/v25/23-1042.html.
  28. How many pretraining tasks are needed for in-context learning of linear regression? In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=vSh5ePa0ph.
  29. Towards analyzing self-attention via linear neural network, 2024. URL https://openreview.net/forum?id=4fVuBf5HE9.
  30. Karthik Duraisamy. Finite sample analysis and bounds of generalization error of gradient descent in in-context linear regression. arXiv preprint arXiv:2405.02462, 2024.
  31. Linformer: Self-attention with linear complexity, 2020.
  32. Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 3531–3539, 2021.
  33. Transformers are RNNs: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020.
  34. Distribution of eigenvalues for some sets of random matrices. Matematicheskii Sbornik, 114(4):507–536, 1967.
  35. Spectral analysis of large dimensional random matrices, volume 20. Springer, 2010.
  36. Surprises in high-dimensional ridgeless least squares interpolation. The Annals of Statistics, 50(2):949 – 986, 2022. doi: 10.1214/21-AOS2133. URL https://doi.org/10.1214/21-AOS2133.
  37. Universality for the global spectrum of random inner-product kernel matrices in the polynomial regime. arXiv, 2023.
  38. Scaling and renormalization in high-dimensional regression. arXiv, 2024.
  39. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
  40. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), San Diega, CA, USA, 2015.
  41. The statistical mechanics of learning a rule. Rev. Mod. Phys., 65:499–556, Apr 1993. doi: 10.1103/RevModPhys.65.499. URL https://link.aps.org/doi/10.1103/RevModPhys.65.499.
  42. Andreas Engel and Christian van den Broeck. Statistical Mechanics of Learning. Cambridge University Press, 2001. doi: https://doi.org/10.1017/CBO9781139164542.
  43. Generalisation error in learning with random features and the hidden manifold model. In International Conference on Machine Learning, pages 3452–3462. PMLR, 2020.
  44. Learning curves of generic features maps for realistic datasets with a teacher-student model. Advances in Neural Information Processing Systems, 34:18137–18151, 2021.
  45. The generalization error of random features regression: Precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 75(4):667–766, 2022.
  46. Universality laws for high-dimensional learning with random features. IEEE Transactions on Information Theory, 69(3):1932–1964, 2022.
  47. Asymptotics of random feature regression beyond the linear scaling regime. arXiv:2403.08160, 2024.
  48. A precise performance analysis of learning with random features. arXiv preprint arXiv:2008.11904, 2020.
  49. A phase transition between positional and semantic learning in a solvable model of dot-product attention. arXiv, 2024.
  50. Universality of empirical risk minimization. In Po-Ling Loh and Maxim Raginsky, editors, Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pages 4310–4312. PMLR, 02–05 Jul 2022. URL https://proceedings.mlr.press/v178/montanari22a.html.
  51. Scaling data-constrained language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=j5BuTrEj35.
  52. The local semicircle law for a general class of random matrices. Electronic Journal of Probability, 18(none):1 – 58, 2013. doi: 10.1214/EJP.v18-2473. URL https://doi.org/10.1214/EJP.v18-2473.
  53. A dynamical approach to random matrix theory, volume 28. American Mathematical Soc., 2017.
  54. Alston S. Householder. Unitary triangularization of a nonsymmetric matrix. J. ACM, 5(4):339–342, oct 1958. ISSN 0004-5411. doi: 10.1145/320941.320947. URL https://doi.org/10.1145/320941.320947.
  55. Yue M. Lu. Householder dice: A matrix-free algorithm for simulating dynamics on Gaussian and random orthogonal ensembles. IEEE Transactions on Information Theory, 67(12):8264–8272, 2021. doi: 10.1109/TIT.2021.3114351.
  56. Lloyd N. Trefethen and David Bau, III. Numerical Linear Algebra. Society for Industrial and Applied Mathematics, Philadelphia, PA, 1997. doi: 10.1137/1.9780898719574. URL https://epubs.siam.org/doi/abs/10.1137/1.9780898719574.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yue M. Lu (52 papers)
  2. Mary I. Letey (2 papers)
  3. Jacob A. Zavatone-Veth (19 papers)
  4. Anindita Maiti (8 papers)
  5. Cengiz Pehlevan (81 papers)
Citations (8)