How Well Can Transformers Emulate In-context Newton's Method? (2403.03183v1)
Abstract: Transformer-based models have demonstrated remarkable in-context learning capabilities, prompting extensive research into its underlying mechanisms. Recent studies have suggested that Transformers can implement first-order optimization algorithms for in-context learning and even second order ones for the case of linear regression. In this work, we study whether Transformers can perform higher order optimization methods, beyond the case of linear regression. We establish that linear attention Transformers with ReLU layers can approximate second order optimization algorithms for the task of logistic regression and achieve $\epsilon$ error with only a logarithmic to the error more layers. As a by-product we demonstrate the ability of even linear attention-only Transformers in implementing a single step of Newton's iteration for matrix inversion with merely two layers. These results suggest the ability of the Transformer architecture to implement complex algorithms, beyond gradient descent.
- Transformers learn to implement preconditioned gradient descent for in-context learning.
- What learning algorithm is in-context learning? investigations with linear models. arXiv preprint arXiv:2211.15661 .
- Transformers as statisticians: Provable in-context learning with in-context algorithm selection.
- Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745 .
- Convex optimization. Cambridge university press.
- Language models are few-shot learners. Advances in neural information processing systems 33 1877–1901.
- Training dynamics of multi-head softmax attention for in-context learning: Emergence, convergence, and optimality. arXiv preprint arXiv:2402.19442 .
- Transformers implement functional gradient descent to learn non-linear functions in context.
- Language models show human-like content effects on reasoning. arXiv preprint arXiv:2207.07051 .
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.
- Transformers learn higher-order optimization methods for in-context learning: A study with linear models.
- What can transformers learn in-context? a case study of simple function classes. Advances in Neural Information Processing Systems 35 30583–30598.
- Looped transformers as programmable computers. arXiv preprint arXiv:2301.13196 .
- How do transformers learn in-context beyond simple functions? a case study on learning with representations. arXiv preprint arXiv:2310.10616 .
- In-context convergence of transformers. arXiv preprint arXiv:2310.05249 .
- Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT.
- Transformers in vision: A survey. ACM computing surveys (CSUR) 54 1–41.
- A family of iterative methods for computing the approximate inverse of a square matrix and inner inverse of a non-square matrix. Applied Mathematics and Computation 215 3433–3442.
- Transformers as algorithms: Generalization and stability in in-context learning. International Conference on Machine Learning .
- Jurassic-1: Technical details and evaluation. White Paper. AI21 Labs 1 9.
- One step of gradient descent is provably the optimal in-context learner with one layer of linear self-attention. arXiv preprint arXiv:2307.03576 .
- Nesterov, Y. et al. (2018). Lectures on convex optimization, vol. 137. Springer.
- Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114 .
- Ogden, H. C. (1969). Iterative methods of matrix inversion .
- An improved newton iteration for the generalized inverse of a matrix, with applications. SIAM Journal on Scientific and Statistical Computing 12 1109–1130.
- Schulz, G. (1933). Iterative berechung der reziproken matrix. ZAMM - Journal of Applied Mathematics and Mechanics / Zeitschrift für Angewandte Mathematik und Mechanik 13 57–59.
- Matrix perturbation theory.
- Composite convex optimization with global and local inexact oracles. Computational Optimization and Applications 76 69–124.
- Attention is all you need. Advances in neural information processing systems 30.
- Transformers learn in-context by gradient descent. arXiv preprint arXiv:2212.07677 .
- Transformers learn in-context by gradient descent.
- Uncovering mesa-optimization algorithms in transformers.
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 .
- Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 .
- Large-scale mimo detection for 3gpp lte: Algorithms and fpga implementations. IEEE Journal of Selected Topics in Signal Processing 8 916–929.
- Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
- Trained transformers learn linear models in-context. arXiv preprint arXiv:2306.09927 .
- Communication-efficient distributed optimization of self-concordant empirical loss.
- Teaching algorithmic reasoning via in-context learning. arXiv preprint arXiv:2211.09066 .
- Angeliki Giannou (9 papers)
- Liu Yang (194 papers)
- Tianhao Wang (98 papers)
- Dimitris Papailiopoulos (59 papers)
- Jason D. Lee (151 papers)