Emergent Mind

Trained Transformers Learn Linear Models In-Context

(2306.09927)
Published Jun 16, 2023 in stat.ML , cs.AI , cs.CL , and cs.LG

Abstract

Attention-based neural networks such as transformers have demonstrated a remarkable ability to exhibit in-context learning (ICL): Given a short prompt sequence of tokens from an unseen task, they can formulate relevant per-token and next-token predictions without any parameter updates. By embedding a sequence of labeled training data and unlabeled test data as a prompt, this allows for transformers to behave like supervised learning algorithms. Indeed, recent work has shown that when training transformer architectures over random instances of linear regression problems, these models' predictions mimic those of ordinary least squares. Towards understanding the mechanisms underlying this phenomenon, we investigate the dynamics of ICL in transformers with a single linear self-attention layer trained by gradient flow on linear regression tasks. We show that despite non-convexity, gradient flow with a suitable random initialization finds a global minimum of the objective function. At this global minimum, when given a test prompt of labeled examples from a new prediction task, the transformer achieves prediction error competitive with the best linear predictor over the test prompt distribution. We additionally characterize the robustness of the trained transformer to a variety of distribution shifts and show that although a number of shifts are tolerated, shifts in the covariate distribution of the prompts are not. Motivated by this, we consider a generalized ICL setting where the covariate distributions can vary across prompts. We show that although gradient flow succeeds at finding a global minimum in this setting, the trained transformer is still brittle under mild covariate shifts. We complement this finding with experiments on large, nonlinear transformer architectures which we show are more robust under covariate shifts.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Sign up for a free account or log in to generate a summary of this paper:

We ran into a problem analyzing this paper.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. A Mechanism for Sample-Efficient In-Context Learning for Sparse Retrieval Tasks
  2. Transformers learn to implement preconditioned gradient descent for in-context learning
  3. In-Context Learning through the Bayesian Prism
  4. A Closer Look at In-Context Learning under Distribution Shifts
  5. What learning algorithm is in-context learning? Investigations with linear models
  6. “Exploring Length Generalization in Large Language Models” In Advances in Neural Information Processing Systems (NeurIPS), 2022
  7. Sanjeev Arora, Nadav Cohen and Elad Hazan “On the optimization of deep networks: Implicit acceleration by overparameterization” In International Conference on Machine Learning, 2018, pp. 244–253
  8. “Implicit regularization in deep matrix factorization” In Advances in Neural Information Processing Systems 32, 2019
  9. “On the implicit bias of initialization shape: Beyond infinitesimal mirror descent” In International Conference on Machine Learning, 2021, pp. 468–477
  10. Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection
  11. On implicit regularization: Morse functions and applications to matrix factorization
  12. On the Computational Power of Transformers and its Implications in Sequence Modeling
  13. Yuejie Chi, Yue M Lu and Yuxin Chen “Nonconvex optimization meets low-rank matrix factorization: An overview” In IEEE Transactions on Signal Processing 67.20 IEEE, 2019, pp. 5239–5269
  14. Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers
  15. “Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context” In Association for Computational Linguistics (ACL), 2019
  16. Universal Transformers
  17. “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” In International Conference on Learning Representations (ICLR), 2021
  18. Simon S Du, Wei Hu and Jason D Lee “Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced” In Advances in neural information processing systems 31, 2018
  19. “Inductive biases and variable creation in self-attention mechanisms” In International Conference on Machine Learning, 2022
  20. What Can Transformers Learn In-Context? A Case Study of Simple Function Classes
  21. “Implicit regularization in matrix factorization” In Advances in Neural Information Processing Systems 30, 2017
  22. Explaining Emergent In-Context Learning as Kernel Regression
  23. Samy Jelassi, Michael Sander and Yuanzhi Li “Vision transformers provably learn spatial structure” In Advances in Neural Information Processing Systems 35, 2022, pp. 37822–37836
  24. Understanding Incremental Learning of Gradient Descent: A Fine-grained Analysis of Matrix Sensing
  25. Adam: A Method for Stochastic Optimization
  26. The Closeness of In-Context Learning and Weight Shifting for Softmax Regression
  27. Transformers as Algorithms: Generalization and Stability in In-context Learning
  28. Yuanzhi Li, Tengyu Ma and Hongyang Zhang “Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations” In Conference On Learning Theory, 2018, pp. 2–47
  29. How Do Transformers Learn Topic Structure: Towards a Mechanistic Understanding
  30. Towards Resolving the Implicit Bias of Gradient Descent for Matrix Factorization: Greedy Low-Rank Learning
  31. On the Expressive Power of Self-Attention Matrices
  32. “Transformers Learn Shortcuts to Automata” In International Conference on Learning Representations (ICLR), 2023
  33. “On a product of positive semidefinite matrices” In Linear algebra and its applications 295.1-3 Elsevier, 1999, pp. 3–6
  34. “An Isserlis’ theorem for mixed Gaussian variables: Application to the auto-bispectral density” In Journal of Statistical Physics 136 Springer, 2009, pp. 89–102
  35. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?
  36. GPT-4 Technical Report
  37. Transformers learn in-context by gradient descent
  38. On the Turing Completeness of Modern Neural Network Architectures
  39. Kaare Brandt Petersen and Michael Syskind Pedersen “The matrix cookbook” In Technical University of Denmark 7.15, 2008, pp. 510
  40. “Improving language understanding by generative pre-training” OpenAI, 2018
  41. “Language models are unsupervised multitask learners” In OpenAI blog 1.8, 2019, pp. 9
  42. Implicit Balancing and Regularization: Generalization and Convergence Guarantees for Overparameterized Asymmetric Matrix Sensing
  43. Mimetic Initialization of Self-Attention Layers
  44. “Attention is all you need” In Advances in Neural Information Processing Systems 30, 2017
  45. Large Language Models Are Latent Variable Models: Explaining and Finding Good Demonstrations for In-Context Learning
  46. Gian-Carlo Wick “The evaluation of the collision matrix” In Physical review 80.2 APS, 1950, pp. 268
  47. “Transformers: State-of-the-art natural language processing” In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, 2020, pp. 38–45
  48. An Explanation of In-context Learning as Implicit Bayesian Inference
  49. Are Transformers universal approximators of sequence-to-sequence functions?
  50. “O (n) connections are expressive enough: Universal approximability of sparse transformers” In Advances in Neural Information Processing Systems 33, 2020, pp. 13783–13794
  51. What and How does In-Context Learning Learn? Bayesian Model Averaging, Parameterization, and Generalization

Show All 51