Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 172 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 34 tok/s Pro
GPT-5 High 40 tok/s Pro
GPT-4o 100 tok/s Pro
Kimi K2 198 tok/s Pro
GPT OSS 120B 436 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

On the Role of Depth and Looping for In-Context Learning with Task Diversity (2410.21698v1)

Published 29 Oct 2024 in cs.LG, math.ST, stat.ML, and stat.TH

Abstract: The intriguing in-context learning (ICL) abilities of deep Transformer models have lately garnered significant attention. By studying in-context linear regression on unimodal Gaussian data, recent empirical and theoretical works have argued that ICL emerges from Transformers' abilities to simulate learning algorithms like gradient descent. However, these works fail to capture the remarkable ability of Transformers to learn multiple tasks in context. To this end, we study in-context learning for linear regression with diverse tasks, characterized by data covariance matrices with condition numbers ranging from $[1, \kappa]$, and highlight the importance of depth in this setting. More specifically, (a) we show theoretical lower bounds of $\log(\kappa)$ (or $\sqrt{\kappa}$) linear attention layers in the unrestricted (or restricted) attention setting and, (b) we show that multilayer Transformers can indeed solve such tasks with a number of layers that matches the lower bounds. However, we show that this expressivity of multilayer Transformer comes at the price of robustness. In particular, multilayer Transformers are not robust to even distributional shifts as small as $O(e{-L})$ in Wasserstein distance, where $L$ is the depth of the network. We then demonstrate that Looped Transformers -- a special class of multilayer Transformers with weight-sharing -- not only exhibit similar expressive power but are also provably robust under mild assumptions. Besides out-of-distribution generalization, we also show that Looped Transformers are the only models that exhibit a monotonic behavior of loss with respect to depth.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Transformers learn to implement preconditioned gradient descent for in-context learning. Advances in Neural Information Processing Systems, 36, 2024a.
  2. Linear attention is (maybe) all you need (to understand transformer optimization). In The Twelfth International Conference on Learning Representations, 2024b.
  3. What learning algorithm is in-context learning? investigations with linear models. arXiv preprint arXiv:2211.15661, 2022.
  4. Physics of language models: Part 1, learning hierarchical language structures. ArXiv e-prints, abs/2305.13673, May, 2023.
  5. Deep learning through the lens of example difficulty. Advances in Neural Information Processing Systems, 2021.
  6. Language models are few-shot learners. In Advances in Neural Information Processing Systems, 2020.
  7. Data distributional properties drive emergent in-context learning in transformers. Advances in Neural Information Processing Systems, 35:18878–18891, 2022.
  8. Why can gpt learn in-context? language models implicitly perform gradient descent as meta-optimizers. arXiv preprint arXiv:2212.10559, 2022.
  9. Transformers learn higher-order optimization methods for in-context learning: A study with linear models. arXiv preprint arXiv:2310.17086, 2023.
  10. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  11. What can transformers learn in-context? a case study of simple function classes. Advances in Neural Information Processing Systems, 35:30583–30598, 2022.
  12. Can looped transformers learn to implement multi-step gradient descent for in-context learning? International Conference on Machine Learning, 2024.
  13. Looped transformers as programmable computers. arXiv preprint arXiv:2301.13196, 2023.
  14. Towards anytime classification in early-exit architectures by enforcing conditional monotonicity. Advances in Neural Information Processing Systems, 2024.
  15. General-purpose in-context learning by meta-learning transformers. arXiv preprint arXiv:2212.04458, 2022.
  16. The effect of diversity in meta-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 8396–8404, 2023.
  17. Exploring monotonicity in early-exiting language models. In Workshop on Efficient Systems for Foundation Models II@ ICML2024.
  18. Fine-grained analysis of in-context linear estimation: Data, architecture, and beyond. arXiv preprint arXiv:2407.10005, 2024.
  19. One step of gradient descent is provably the optimal in-context learner with one layer of linear self-attention. arXiv preprint arXiv:2307.03576, 2023.
  20. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837, 2022.
  21. Trainable transformer in transformer. arXiv preprint arXiv:2307.01189, 2023.
  22. On the turing completeness of modern neural network architectures. arXiv preprint arXiv:1901.03429, 2019.
  23. Attention is turing-complete. Journal of Machine Learning Research, 22(75):1–35, 2021.
  24. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
  25. Pretraining task diversity and the emergence of non-bayesian in-context learning for regression. Advances in Neural Information Processing Systems, 36, 2024.
  26. Linear transformers are secretly fast weight programmers. In International Conference on Machine Learning, pages 9355–9366. PMLR, 2021.
  27. Confident adaptive language modeling. Advances in Neural Information Processing Systems, 2022.
  28. Branchynet: Fast inference via early exiting from deep neural networks. In 2016 23rd International Conference on Pattern Recognition (ICPR), 2016.
  29. Provable meta-learning of linear representations. In International Conference on Machine Learning, pages 10434–10443. PMLR, 2021.
  30. A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017.
  31. Uncovering mesa-optimization algorithms in transformers, sep 2023. URL http://arxiv. org/abs/2309.05858.→ p, 9.
  32. Transformers learn in-context by gradient descent. In International Conference on Machine Learning. PMLR, 2023.
  33. An explanation of in-context learning as implicit bayesian inference. arXiv preprint arXiv:2111.02080, 2021.
  34. Looped transformers are better at learning learning algorithms. arXiv preprint arXiv:2311.12424, 2023.
  35. Meta-learning without memorization. arXiv preprint arXiv:1912.03820, 2019.
  36. Trained transformers learn linear models in-context. Journal of Machine Learning Research, 25(49):1–55, 2024.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 3 tweets and received 8 likes.

Upgrade to Pro to view all of the tweets about this paper: