In-Context Learning with Representations: Contextual Generalization of Trained Transformers (2408.10147v2)

Published 19 Aug 2024 in cs.LG, cs.CL, cs.IT, math.IT, math.OC, and stat.ML

Abstract: In-context learning (ICL) refers to a remarkable capability of pretrained LLMs, which can learn a new task given a few examples during inference. However, theoretical understanding of ICL is largely under-explored, particularly whether transformers can be trained to generalize to unseen examples in a prompt, which will require the model to acquire contextual knowledge of the prompt for generalization. This paper investigates the training dynamics of transformers by gradient descent through the lens of non-linear regression tasks. The contextual generalization here can be attained via learning the template function for each task in-context, where all template functions lie in a linear space with $m$ basis functions. We analyze the training dynamics of one-layer multi-head transformers to in-contextly predict unlabeled inputs given partially labeled prompts, where the labels contain Gaussian noise and the number of examples in each prompt are not sufficient to determine the template. Under mild assumptions, we show that the training loss for a one-layer multi-head transformer converges linearly to a global minimum. Moreover, the transformer effectively learns to perform ridge regression over the basis functions. To our knowledge, this study is the first provable demonstration that transformers can learn contextual (i.e., template) information to generalize to both unseen examples and tasks when prompts contain only a small number of query-answer pairs.

PDF HTML Abstract

In-Context Learning with Representations: A Formal Analysis of Transformers' Contextual Generalization Capabilities

Introduction

The paper "In-Context Learning with Representations: Contextual Generalization of Trained Transformers" by Tong Yang, Yu Huang, Yingbin Liang, and Yuejie Chi addresses a critical question within the field of machine learning: how transformers can generalize to unseen examples during inference through in-context learning (ICL). The paper places a significant emphasis on understanding the training dynamics of transformers on non-linear regression tasks, with a particular focus on how these models can learn contextual information from prompts to generalize effectively.

Analytical Framework and Assumptions

The authors start by defining a structured problem setup where the template function for each task is defined within the linear space spanned by $m$ basis functions. Each prompt contains a set of examples with Gaussian noisy labels, and the challenge lies in learning these templates in-context, despite the prompts being underdetermined.

Key assumptions in the analysis include:

The coefficient vector $\lambda$ of the representation map is drawn from a specified distribution with zero mean and unit variance.
The initial parameters are configured such that each matrix $\mathbf{C}_k$ formed by the columns of the prompt token matrix $\mathbf{X}$ exhibits full row rank.
Learning rates for parameter updates are specified to ensure stability and convergence.

Main Contributions and Theoretical Findings

The central contributions of this paper can be summarized as follows:

Convergence Guarantee: It is shown that the training loss of a one-layer multi-head softmax attention transformer trained by gradient descent converges linearly to a global minimum. This convergence is analyzed under the assumptions that the initialization conditions are met, and specific learning rates are chosen.
Inference Performance: Post-training, the transformer effectively performs ridge regression over the basis functions to predict unseen labels. The iteration complexity for achieving a target inference accuracy is derived, highlighting the efficiency of the transformer in estimating the underlying template even with limited prompt length.
Overcoming Previous Constraints: The paper relaxes several stringent conditions imposed in earlier studies, such as orthogonality of data, the necessity of large prompt lengths, and restrictive initialization. It also offers insights into the impact of the number of attention heads, favoring configurations where $H \geq N$ .

Implications and Speculative Outlook

Practical Implications: The analysis provides a robust framework for understanding how transformers can be fine-tuned to achieve effective ICL. Practically, this means that transformers can be pre-trained on a variety of tasks and later be expected to generalize to new, unseen tasks without substantial additional training. This opens up substantial possibilities for applications requiring adaptability across diverse contexts without needing extensive retraining.

Theoretical Implications: The work extends the theoretical underpinnings of transformers by showing that multi-head attention mechanisms are critical for learning inherently contextual information from prompts. This aligns with the known expressive power of transformers but provides a novel theoretical insight into their generalization mechanisms via ridge regression.

Future Directions: This paper paves the way for further exploration in several directions. One could investigate deeper transformer architectures and their training dynamics on even more complex function classes. The impact of varying the number of attention heads $H$ and the length of prompts $N$ on the transformers' performance can be explored more rigorously. Additionally, extending the current analysis to transformers with multiple layers and exploring the interaction between layers might provide more comprehensive insights into the transformers' learning dynamics.

Conclusion

This paper provides a rigorous and formal analysis of how transformers can acquire and utilize contextual information to generalize effectively to unseen examples and tasks. The convergence guarantees, coupled with the demonstrated inference time performance, underscore the transformer’s capabilities in performing in-context learning with representations. With its relaxed assumptions and robust theoretical framework, this paper contributes significantly to our understanding of transformers’ learning dynamics and their applicability to real-world tasks.