Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Understanding In-Context Learning in Transformers and LLMs by Learning to Learn Discrete Functions (2310.03016v1)

Published 4 Oct 2023 in cs.LG and cs.CL

Abstract: In order to understand the in-context learning phenomenon, recent works have adopted a stylized experimental framework and demonstrated that Transformers can learn gradient-based learning algorithms for various classes of real-valued functions. However, the limitations of Transformers in implementing learning algorithms, and their ability to learn other forms of algorithms are not well understood. Additionally, the degree to which these capabilities are confined to attention-based models is unclear. Furthermore, it remains to be seen whether the insights derived from these stylized settings can be extrapolated to pretrained LLMs. In this work, we take a step towards answering these questions by demonstrating the following: (a) On a test-bed with a variety of Boolean function classes, we find that Transformers can nearly match the optimal learning algorithm for 'simpler' tasks, while their performance deteriorates on more 'complex' tasks. Additionally, we find that certain attention-free models perform (almost) identically to Transformers on a range of tasks. (b) When provided a teaching sequence, i.e. a set of examples that uniquely identifies a function in a class, we show that Transformers learn more sample-efficiently. Interestingly, our results show that Transformers can learn to implement two distinct algorithms to solve a single task, and can adaptively select the more sample-efficient algorithm depending on the sequence of in-context examples. (c) Lastly, we show that extant LLMs, e.g. LLaMA-2, GPT-4, can compete with nearest-neighbor baselines on prediction tasks that are guaranteed to not be in their training set.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. In-context learning through the bayesian prism. arXiv preprint arXiv:2306.04891, 2023.
  2. A closer look at in-context learning under distribution shifts. arXiv preprint arXiv:2305.16704, 2023.
  3. What learning algorithm is in-context learning? investigations with linear models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=0g0X4H8yN4I.
  4. Transformers as statisticians: Provable in-context learning with in-context algorithm selection. arXiv preprint arXiv:2306.04637, 2023.
  5. On the Ability and Limitations of Transformers to Recognize Formal Languages. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  7096–7116, Online, November 2020a. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.576. URL https://aclanthology.org/2020.emnlp-main.576.
  6. On the practical ability of recurrent neural networks to recognize hierarchical languages. In Proceedings of the 28th International Conference on Computational Linguistics, pp.  1481–1494, Barcelona, Spain (Online), December 2020b. International Committee on Computational Linguistics. doi: 10.18653/v1/2020.coling-main.129. URL https://aclanthology.org/2020.coling-main.129.
  7. Simplicity bias in transformers and their ability to learn sparse Boolean functions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  5767–5791, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.317. URL https://aclanthology.org/2023.acl-long.317.
  8. Noise-tolerant learning, the parity problem, and the statistical query model. Journal of the ACM (JACM), 50(4):506–519, 2003.
  9. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  10. Data distributional properties drive emergent in-context learning in transformers. Advances in Neural Information Processing Systems, 35:18878–18891, 2022.
  11. Overcoming a theoretical limitation of self-attention. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  7654–7664, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.527. URL https://aclanthology.org/2022.acl-long.527.
  12. Why can gpt learn in-context? language models secretly perform gradient descent as meta optimizers. arXiv preprint arXiv:2212.10559, 2022.
  13. Neural networks and the chomsky hierarchy. arXiv preprint arXiv:2207.02098, 2022.
  14. A survey for in-context learning. arXiv preprint arXiv:2301.00234, 2022.
  15. How can self-attention networks recognize dyck-n languages? arXiv preprint arXiv:2010.04303, 2020.
  16. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html.
  17. On learning ring-sum-expansions. SIAM Journal on Computing, 21(1):181–192, 1992.
  18. What can transformers learn in-context? a case study of simple function classes. Advances in Neural Information Processing Systems, 35:30583–30598, 2022.
  19. On the complexity of teaching. Journal of Computer and System Sciences, 50(1):20–31, 1995.
  20. Diagonal state spaces are as effective as structured state spaces. Advances in Neural Information Processing Systems, 35:22982–22994, 2022.
  21. Michael Hahn. Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics, 8:156–171, 2020. doi: 10.1162/tacl˙a˙00306. URL https://aclanthology.org/2020.tacl-1.11.
  22. A theory of emergent in-context learning as implicit structure induction. arXiv preprint arXiv:2303.07971, 2023.
  23. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  24. Michael Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the ACM (JACM), 45(6):983–1006, 1998.
  25. A field guide to dynamical recurrent networks. John Wiley & Sons, 2001.
  26. Transformers as algorithms: Generalization and implicit model selection in in-context learning. arXiv preprint arXiv:2301.07067, 2023.
  27. Transformers learn shortcuts to automata. arXiv preprint arXiv:2210.10749, 2022.
  28. What makes good in-context examples for gpt-3333? arXiv preprint arXiv:2101.06804, 2021.
  29. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  8086–8098, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.556. URL https://aclanthology.org/2022.acl-long.556.
  30. Saturated transformers are constant-depth threshold circuits. Transactions of the Association for Computational Linguistics, 10:843–856, 2022. doi: 10.1162/tacl˙a˙00493. URL https://aclanthology.org/2022.tacl-1.49.
  31. Metaicl: Learning to learn in context. arXiv preprint arXiv:2110.15943, 2021.
  32. Noisy channel language model prompting for few-shot text classification. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  5316–5330, Dublin, Ireland, May 2022a. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.365. URL https://aclanthology.org/2022.acl-long.365.
  33. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837, 2022b.
  34. Ryan O’Donnell. Analysis of boolean functions, 2021. URL https://arxiv.org/abs/2105.10386.
  35. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022.
  36. OpenAI. Gpt-4 technical report, 2023.
  37. Training language models to follow instructions with human feedback, 2022.
  38. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  39. On the turing completeness of modern neural network architectures. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=HyGBdo0qFm.
  40. Hyena hierarchy: Towards larger convolutional language models. arXiv preprint arXiv:2302.10866, 2023.
  41. Impact of pretraining term frequencies on few-shot reasoning. arXiv preprint arXiv:2202.07206, 2022.
  42. Oded Regev. The learning with errors problem. Invited survey in CCC, 7(30):11, 2010.
  43. Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, 2023.
  44. Llama 2: Open foundation and fine-tuned chat models, 2023.
  45. Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.
  46. Attention is all you need. In Advances in neural information processing systems, pp.  5998–6008, 2017.
  47. Transformers learn in-context by gradient descent. arXiv preprint arXiv:2212.07677, 2022.
  48. Larger language models do in-context learning differently, 2023.
  49. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pp.  38–45, 2020.
  50. An explanation of in-context learning as implicit bayesian inference. arXiv preprint arXiv:2111.02080, 2021.
  51. Are transformers universal approximators of sequence-to-sequence functions? In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=ByxRM0Ntvr.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Satwik Bhattamishra (13 papers)
  2. Arkil Patel (14 papers)
  3. Phil Blunsom (87 papers)
  4. Varun Kanade (41 papers)
Citations (31)
X Twitter Logo Streamline Icon: https://streamlinehq.com