What Can Transformers Learn In-Context? A Case Study of Simple Function Classes (2208.01066v3)

Published 1 Aug 2022 in cs.CL and cs.LG

Abstract: In-context learning refers to the ability of a model to condition on a prompt sequence consisting of in-context examples (input-output pairs corresponding to some task) along with a new query input, and generate the corresponding output. Crucially, in-context learning happens only at inference time without any parameter updates to the model. While LLMs such as GPT-3 exhibit some ability to perform in-context learning, it is unclear what the relationship is between tasks on which this succeeds and what is present in the training data. To make progress towards understanding in-context learning, we consider the well-defined problem of training a model to in-context learn a function class (e.g., linear functions): that is, given data derived from some functions in the class, can we train a model to in-context learn "most" functions from this class? We show empirically that standard Transformers can be trained from scratch to perform in-context learning of linear functions -- that is, the trained model is able to learn unseen linear functions from in-context examples with performance comparable to the optimal least squares estimator. In fact, in-context learning is possible even under two forms of distribution shift: (i) between the training data of the model and inference-time prompts, and (ii) between the in-context examples and the query input during inference. We also show that we can train Transformers to in-context learn more complex function classes -- namely sparse linear functions, two-layer neural networks, and decision trees -- with performance that matches or exceeds task-specific learning algorithms. Our code and models are available at https://github.com/dtsip/in-context-learning .

PDF Abstract

Analyzing the In-Context Learning Capabilities of Transformers for Simple Function Classes

The empirical paper "What Can Transformers Learn In-Context? A Case Study of Simple Function Classes" investigates the in-context learning capabilities of transformer models, specifically focusing on their ability to learn simple function classes during inference time without parameter updates. This research addresses fundamental questions on the capabilities, limitations, and robustness of transformer models in performing tasks typically requiring parameter adjustments, by effectively performing in-context learning of function classes such as linear functions, sparse linear functions, neural networks, and decision trees.

Key Contributions and Findings

Methodology and Setup

The paper employs a well-defined experimental setup where transformer models are trained to recognize and perform in-context learning on distinct function classes. These classes include linear functions, sparse linear functions, two-layer neural networks, and decision trees. The primary focus is on assessing whether transformers can generate the accurate value of a given function based on observed input-output pairs (in-context examples), and then predict the output for a new input.

To generate training prompts for the transformer models, the researchers sample random functions from the specified function class, coupled with input instances drawn from a standard distribution (e.g., isotropic Gaussian). The core configuration includes a transformer architecture with 12 layers, 8 attention heads, and a 256-dimensional embedding space.

Empirical Results

The paper provides comprehensive empirical evidence demonstrating that transformer models can indeed be trained from scratch to in-context learn linear functions with performance comparable to the optimal least squares estimator. This capability is noted to be robust under various conditions of distribution shift.

Linear Functions:
- Transformers trained on linear functions mimic the behavior of the least squares estimator closely. With 40 in-context examples for 20-dimensional inputs, the transformer achieves a normalized squared error of approximately 0.0006.
- The models' behavior deviates slightly under distribution shifts but remains competitive with least squares.
Sparse Linear Functions:
- For sparse linear functions (e.g., with 3 non-zero dimensions among 20), the transformers can exploit this sparsity, achieving performance levels comparable to the Lasso estimator for sparsity-regularized regression.
Decision Trees:
- Transformers trained on decision trees demonstrate superior performance compared to standard greedy tree algorithms and boosting techniques like XGBoost. With 100 in-context examples, transformers vastly outperform these traditional learning algorithms.
Two-layer Neural Networks:
- For two-layer ReLU neural networks, transformers match the performance of equivalent neural networks trained via gradient descent. Interestingly, these transformers also exhibit the capacity to in-context learn linear functions despite being trained on more complex two-layer architecture functions initially.

Robustness and Distribution Shifts

A noteworthy aspect of this paper is the exploration of the robustness of transformer models to different distribution shifts:

Noise in Outputs: Transformers remain robust when noise is added to the output values of in-context examples.
Subspace Variations: When input examples are restricted to a lower-dimensional subspace or when query inputs are drawn from a different distribution, the performance degrades gracefully.
Prompt Scale Variations: Transformers show differential robustness depending on whether the scaling is applied to the inputs or the coefficients of the function class.

Implications and Future Prospects

The results have significant implications for the understanding and development of transformer-based models. This paper suggests that transformers can encode complex learning algorithms implicitly within their architecture, which enables them to efficiently generalize and perform meta-learning tasks.

Practical Implications

Meta-Learning: Insights from this paper can be applied to other meta-learning contexts where the adaptability to new tasks without explicit retraining is essential.
Algorithm Discovery: Transformers’ ability to encode sophisticated learning algorithms hints at the potential for discovering new learning methods through reverse engineering trained models.

Future Directions

Several promising directions emerge from this research:

Scalability and Complexity: Further studies can investigate scaling this approach to even higher dimensions and more complex function classes, potentially involving real-world applications.
Comparative Analysis: Extending the comparative analysis to other model architectures such as LSTMs or convolutional networks may provide further insights into the inherent inductive biases and capabilities of these models.
Curriculum Learning: The observed benefits of curriculum learning pose interesting questions about optimal training paradigms for large-scale models, possibly informing more efficient training techniques for LLMs and beyond.

Conclusion

This comprehensive empirical investigation highlights the robust in-context learning abilities of transformers across a range of simple function classes, emphasizing their potential to generalize sophisticated learning algorithms. These findings contribute to a deeper understanding of the meta-learning capabilities of transformers and pave the way for future research and practical applications in AI and machine learning.