Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Understanding How Transformers Learn In-context Through a Representation Learning Lens (2310.13220v2)

Published 20 Oct 2023 in cs.LG

Abstract: Pre-trained LLMs based on Transformers have demonstrated remarkable in-context learning (ICL) abilities. With just a few demonstration examples, the models can implement new tasks without any parameter updates. However, it is still an open question to understand the mechanism of ICL. In this paper, we attempt to explore the ICL process in Transformers through a lens of representation learning. Initially, leveraging kernel methods, we figure out a dual model for one softmax attention layer. The ICL inference process of the attention layer aligns with the training procedure of its dual model, generating token representation predictions that are equivalent to the dual model's test outputs. We delve into the training process of this dual model from a representation learning standpoint and further derive a generalization error bound related to the quantity of demonstration tokens. Subsequently, we extend our theoretical conclusions to more complicated scenarios, including one Transformer layer and multiple attention layers. Furthermore, drawing inspiration from existing representation learning methods especially contrastive learning, we propose potential modifications for the attention layer. Finally, experiments are designed to support our findings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Theoretical foundations of the potential function method in pattern recognition. Avtomat. i Telemeh 25, 6 (1964), 917–936.
  2. What learning algorithm is in-context learning? investigations with linear models. arXiv preprint arXiv:2211.15661 (2022).
  3. Shun-ichi Amari. 1993. Backpropagation and stochastic gradient descent method. Neurocomputing 5, 4-5 (1993), 185–196.
  4. Learning representations by maximizing mutual information across views. Advances in neural information processing systems 32 (2019).
  5. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  6. Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems 33 (2020), 9912–9924.
  7. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597–1607.
  8. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 (2020).
  9. Xinlei Chen and Kaiming He. 2021. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 15750–15758.
  10. Rethinking attention with performers. arXiv preprint arXiv:2009.14794 (2020).
  11. Why can gpt learn in-context? language models secretly perform gradient descent as meta optimizers. arXiv preprint arXiv:2212.10559 (2022).
  12. A survey for in-context learning. arXiv preprint arXiv:2301.00234 (2022).
  13. What can transformers learn in-context? a case study of simple function classes. Advances in Neural Information Processing Systems 35 (2022), 30583–30598.
  14. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems 33 (2020), 21271–21284.
  15. Cmt: Convolutional neural networks meet vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12175–12185.
  16. In-Context Learning of Large Language Models Explained as Kernel Regression. arXiv preprint arXiv:2305.12766 (2023).
  17. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9729–9738.
  18. Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016).
  19. The dual form of neural networks revisited: Connecting test time predictions to training patterns via spotlights of attention. In International Conference on Machine Learning. PMLR, 9639–9659.
  20. Transformers are rnns: Fast autoregressive transformers with linear attention. (2020), 5156–5165.
  21. Transformers as Algorithms: Generalization and Stability in In-context Learning. arXiv preprint arXiv:2301.07067 (2023).
  22. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys 55, 9 (2023), 1–35.
  23. Soft: Softmax-free transformer with linear complexity. Advances in Neural Information Processing Systems 34 (2021), 21297–21309.
  24. J. Mercer. 1909. Functions of Positive and Negative Type, and their Connection with the Theory of Integral Equations. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 209 (1909), 415–446. http://www.jstor.org/stable/91043
  25. Ishan Misra and Laurens van der Maaten. 2020. Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6707–6717.
  26. Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4004–4012.
  27. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
  28. Random feature attention. arXiv preprint arXiv:2103.02143 (2021).
  29. Simplex random features. In International Conference on Machine Learning. PMLR, 28864–28888.
  30. Kihyuk Sohn. 2016. Improved deep metric learning with multi-class n-pair loss objective. Advances in neural information processing systems 29 (2016).
  31. Understanding self-supervised learning dynamics without contrastive pairs. In International Conference on Machine Learning. PMLR, 10268–10278.
  32. Attention is all you need. Advances in neural information processing systems 30 (2017).
  33. Transformers learn in-context by gradient descent. (2023), 35151–35174.
  34. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 (2022).
  35. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3733–3742.
  36. An Explanation of In-context Learning as Implicit Bayesian Inference. arXiv preprint arXiv:2111.02080 (2021).
  37. Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6210–6219.
  38. Orthogonal random features. Advances in neural information processing systems 29 (2016).
  39. What and How does In-Context Learning Learn? Bayesian Model Averaging, Parameterization, and Generalization. arXiv preprint arXiv:2305.19420 (2023).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Ruifeng Ren (4 papers)
  2. Yong Liu (721 papers)
Citations (9)