Emergent Mind

Published Jun 16, 2023
in
stat.ML
,
cs.AI
,
cs.CL
,
and
cs.LG

Attention-based neural networks such as transformers have demonstrated a remarkable ability to exhibit in-context learning (ICL): Given a short prompt sequence of tokens from an unseen task, they can formulate relevant per-token and next-token predictions without any parameter updates. By embedding a sequence of labeled training data and unlabeled test data as a prompt, this allows for transformers to behave like supervised learning algorithms. Indeed, recent work has shown that when training transformer architectures over random instances of linear regression problems, these models' predictions mimic those of ordinary least squares. Towards understanding the mechanisms underlying this phenomenon, we investigate the dynamics of ICL in transformers with a single linear self-attention layer trained by gradient flow on linear regression tasks. We show that despite non-convexity, gradient flow with a suitable random initialization finds a global minimum of the objective function. At this global minimum, when given a test prompt of labeled examples from a new prediction task, the transformer achieves prediction error competitive with the best linear predictor over the test prompt distribution. We additionally characterize the robustness of the trained transformer to a variety of distribution shifts and show that although a number of shifts are tolerated, shifts in the covariate distribution of the prompts are not. Motivated by this, we consider a generalized ICL setting where the covariate distributions can vary across prompts. We show that although gradient flow succeeds at finding a global minimum in this setting, the trained transformer is still brittle under mild covariate shifts. We complement this finding with experiments on large, nonlinear transformer architectures which we show are more robust under covariate shifts.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a detailed summary of this paper with a premium account.

We ran into a problem analyzing this paper.

Please try again later (sorry!).

Unsubscribe anytime.

References

- A Mechanism for Sample-Efficient In-Context Learning for Sparse Retrieval Tasks
- Transformers learn to implement preconditioned gradient descent for in-context learning
- In-Context Learning through the Bayesian Prism
- A Closer Look at In-Context Learning under Distribution Shifts
- What learning algorithm is in-context learning? Investigations with linear models
- “Exploring Length Generalization in Large Language Models” In Advances in Neural Information Processing Systems (NeurIPS), 2022
- Sanjeev Arora, Nadav Cohen and Elad Hazan “On the optimization of deep networks: Implicit acceleration by overparameterization” In International Conference on Machine Learning, 2018, pp. 244–253
- “Implicit regularization in deep matrix factorization” In Advances in Neural Information Processing Systems 32, 2019
- “On the implicit bias of initialization shape: Beyond infinitesimal mirror descent” In International Conference on Machine Learning, 2021, pp. 468–477
- Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection
- On implicit regularization: Morse functions and applications to matrix factorization
- On the Computational Power of Transformers and its Implications in Sequence Modeling
- Yuejie Chi, Yue M Lu and Yuxin Chen “Nonconvex optimization meets low-rank matrix factorization: An overview” In IEEE Transactions on Signal Processing 67.20 IEEE, 2019, pp. 5239–5269
- Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers
- “Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context” In Association for Computational Linguistics (ACL), 2019
- Universal Transformers
- “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” In International Conference on Learning Representations (ICLR), 2021
- Simon S Du, Wei Hu and Jason D Lee “Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced” In Advances in neural information processing systems 31, 2018
- “Inductive biases and variable creation in self-attention mechanisms” In International Conference on Machine Learning, 2022
- What Can Transformers Learn In-Context? A Case Study of Simple Function Classes
- “Implicit regularization in matrix factorization” In Advances in Neural Information Processing Systems 30, 2017
- Explaining Emergent In-Context Learning as Kernel Regression
- Samy Jelassi, Michael Sander and Yuanzhi Li “Vision transformers provably learn spatial structure” In Advances in Neural Information Processing Systems 35, 2022, pp. 37822–37836
- Understanding Incremental Learning of Gradient Descent: A Fine-grained Analysis of Matrix Sensing
- Adam: A Method for Stochastic Optimization
- The Closeness of In-Context Learning and Weight Shifting for Softmax Regression
- Transformers as Algorithms: Generalization and Stability in In-context Learning
- Yuanzhi Li, Tengyu Ma and Hongyang Zhang “Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations” In Conference On Learning Theory, 2018, pp. 2–47
- How Do Transformers Learn Topic Structure: Towards a Mechanistic Understanding
- Towards Resolving the Implicit Bias of Gradient Descent for Matrix Factorization: Greedy Low-Rank Learning
- On the Expressive Power of Self-Attention Matrices
- “Transformers Learn Shortcuts to Automata” In International Conference on Learning Representations (ICLR), 2023
- “On a product of positive semidefinite matrices” In Linear algebra and its applications 295.1-3 Elsevier, 1999, pp. 3–6
- “An Isserlis’ theorem for mixed Gaussian variables: Application to the auto-bispectral density” In Journal of Statistical Physics 136 Springer, 2009, pp. 89–102
- Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?
- GPT-4 Technical Report
- Transformers learn in-context by gradient descent
- On the Turing Completeness of Modern Neural Network Architectures
- Kaare Brandt Petersen and Michael Syskind Pedersen “The matrix cookbook” In Technical University of Denmark 7.15, 2008, pp. 510
- “Improving language understanding by generative pre-training” OpenAI, 2018
- “Language models are unsupervised multitask learners” In OpenAI blog 1.8, 2019, pp. 9
- Implicit Balancing and Regularization: Generalization and Convergence Guarantees for Overparameterized Asymmetric Matrix Sensing
- Mimetic Initialization of Self-Attention Layers
- “Attention is all you need” In Advances in Neural Information Processing Systems 30, 2017
- Large Language Models Are Latent Variable Models: Explaining and Finding Good Demonstrations for In-Context Learning
- Gian-Carlo Wick “The evaluation of the collision matrix” In Physical review 80.2 APS, 1950, pp. 268
- “Transformers: State-of-the-art natural language processing” In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, 2020, pp. 38–45
- An Explanation of In-context Learning as Implicit Bayesian Inference
- Are Transformers universal approximators of sequence-to-sequence functions?
- “O (n) connections are expressive enough: Universal approximability of sparse transformers” In Advances in Neural Information Processing Systems 33, 2020, pp. 13783–13794
- What and How does In-Context Learning Learn? Bayesian Model Averaging, Parameterization, and Generalization