Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An Explanation of In-context Learning as Implicit Bayesian Inference (2111.02080v6)

Published 3 Nov 2021 in cs.CL and cs.LG
An Explanation of In-context Learning as Implicit Bayesian Inference

Abstract: LLMs (LMs) such as GPT-3 have the surprising ability to do in-context learning, where the model learns to do a downstream task simply by conditioning on a prompt consisting of input-output examples. The LM learns from these examples without being explicitly pretrained to learn. Thus, it is unclear what enables in-context learning. In this paper, we study how in-context learning can emerge when pretraining documents have long-range coherence. Here, the LM must infer a latent document-level concept to generate coherent next tokens during pretraining. At test time, in-context learning occurs when the LM also infers a shared latent concept between examples in a prompt. We prove when this occurs despite a distribution mismatch between prompts and pretraining data in a setting where the pretraining distribution is a mixture of HMMs. In contrast to messy large-scale datasets used to train LMs capable of in-context learning, we generate a small-scale synthetic dataset (GINC) where Transformers and LSTMs both exhibit in-context learning. Beyond the theory, experiments on GINC exhibit large-scale real-world phenomena including improved in-context performance with model scaling (despite the same pretraining loss), sensitivity to example order, and instances where zero-shot is better than few-shot in-context learning.

An Explanation of In-context Learning as Implicit Bayesian Inference

This paper presents a theoretical framework that explains the phenomenon of in-context learning observed in LLMs such as GPT-3. The authors provide a sophisticated analysis, positing that in-context learning can be understood as a form of implicit Bayesian inference.

Theoretical Framework

The work highlights how LLMs are capable of performing in-context learning without explicit instruction during pretraining. This ability is examined by considering pretraining documents as having long-range coherence. The model must infer a latent document-level concept to maintain coherence during text generation. In testing scenarios, the LLM infers shared latent concepts between presented examples, facilitating in-context learning.

The authors propose a novel theoretical model where pretraining follows a distribution characterized as a mixture of Hidden Markov Models (HMMs). They analytically prove that LLMs perform optimally within this framework when in-context learning emerges despite distribution mismatches between prompts and pretraining data.

Experimental Design with GINC Dataset

To support their theoretical assertions, the authors introduce the Generative IN-Context learning (GINC) dataset. This synthetic dataset enables the authors to precisely control latent structure and distributional conditions. Both Transformers and LSTMs are shown to exhibit in-context learning when trained on GINC.

The experiments demonstrate that in-context learning accuracy correlates with model scaling, number of examples, and length of examples. Notably, larger models improve in-context learning performance beyond mere training data memorization. This setup also reproduces certain real-world phenomena, such as sensitivity to example order and instances where zero-shot learning surpasses few-shot performance.

Implications and Future Steps

The findings have several critical implications for the understanding of LLMs:

  • Theoretical Insight: The Bayesian inference perspective provides a structured way to model in-context learning mathematically, supporting further theoretical exploration.
  • Practical Impact: The research offers pathways to optimize LLM pretraining and prompting strategies to enhance in-context learning capabilities.
  • Model Scalability: Confirmation that model scaling contributes to performance improves understanding of the scalability dynamics in LLM training.

Future research directions may involve bridging the distributional mismatch between training data and prompts, exploring the capacity for extrapolation to unseen tasks, and further investigating the role of model architecture in supporting in-context learning.

Conclusion

The paper's analysis contributes significantly to the conceptualization of in-context learning as implicit Bayesian inference, providing both theoretical grounding and empirical validation through carefully designed experiments with GINC. This explanation offers clarity regarding the emergent behaviors of LLMs and opens avenues for refining AI systems' ability to perform complex tasks adaptively.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Sang Michael Xie (21 papers)
  2. Aditi Raghunathan (56 papers)
  3. Percy Liang (239 papers)
  4. Tengyu Ma (117 papers)
Citations (618)
Youtube Logo Streamline Icon: https://streamlinehq.com