Papers

Topics

Authors

Recent

View all

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 83 tok/s

Gemini 2.5 Pro 46 tok/s Pro

GPT-5 Medium 26 tok/s Pro

GPT-5 High 27 tok/s Pro

GPT-4o 104 tok/s Pro

Kimi K2 170 tok/s Pro

GPT OSS 120B 468 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

Is In-Context Learning Learning? (2509.10414v1)

Published 12 Sep 2025 in cs.CL, cs.AI, and cs.LG

Abstract: In-context learning (ICL) allows some autoregressive models to solve tasks via next-token prediction and without needing further training. This has led to claims about these model's ability to solve (learn) unseen tasks with only a few shots (exemplars) in the prompt. However, deduction does not always imply learning, as ICL does not explicitly encode a given observation. Instead, the models rely on their prior knowledge and the exemplars given, if any. We argue that, mathematically, ICL does constitute learning, but its full characterisation requires empirical work. We then carry out a large-scale analysis of ICL ablating out or accounting for memorisation, pretraining, distributional shifts, and prompting style and phrasing. We find that ICL is an effective learning paradigm, but limited in its ability to learn and generalise to unseen tasks. We note that, in the limit where exemplars become more numerous, accuracy is insensitive to exemplar distribution, model, prompt style, and the input's linguistic features. Instead, it deduces patterns from regularities in the prompt, which leads to distributional sensitivity, especially in prompting styles such as chain-of-thought. Given the varied accuracies on formally similar tasks, we conclude that autoregression's ad-hoc encoding is not a robust mechanism, and suggests limited all-purpose generalisability.

Summary

The paper shows that in-context learning, using next-token prediction without weight updates, fits within the PAC learning framework.
The study employs extensive experiments with varied prompting strategies, revealing that increasing exemplar count significantly boosts accuracy.
ICL demonstrates robustness to training shifts but remains vulnerable to out-of-distribution inputs, challenging its generalizability.

Is In-Context Learning Learning? An Empirical and Theoretical Analysis

Introduction

The paper "Is In-Context Learning Learning?" (2509.10414) presents a rigorous investigation into the nature of in-context learning (ICL) in autoregressive LLMs. The central question addressed is whether ICL, as performed by LLMs via next-token prediction, constitutes genuine learning in the formal sense, or merely reflects deduction and memorization of prior knowledge. The authors combine theoretical framing with extensive empirical analysis, ablating factors such as memorization, pretraining, distributional shifts, and prompt phrasing, to characterize the learning dynamics and limitations of ICL.

Theoretical Framework

The paper grounds its analysis in the PAC (Probably Approximately Correct) learning framework, reframing it to focus on the learner's robustness to distributional shifts. In this context, learning is defined as the ability to generalize from a training distribution $P$ to a test distribution $Q \neq P$ , with bounded error. The authors formalize ICL as a process where the LLM, conditioned on a prompt and exemplars, predicts labels for new inputs. Importantly, ICL does not update model weights; instead, it leverages the prompt and exemplars to perform ad hoc inference.

The paper argues that, mathematically, ICL fits within the PAC learning paradigm, as the model's predictions are conditioned on observed exemplars and prompt context. However, the mechanism by which ICL encodes and utilizes information is fundamentally different from traditional learning algorithms, raising questions about its robustness and generalizability.

Experimental Design

The empirical paper is extensive, involving four LLMs (GPT-4 Turbo, GPT-4o, Mixtral 8x7B, Phi-3.5 MoE), nine tasks spanning regular and context-free languages, and multiple prompting strategies:

Modus Ponens (n-shot)
Description-based prompts
Automated Prompt Optimization (APO)
Direct Encoding (DE)
Chain-of-Thought (CoT)
Word Salad and Salad-of-Thought (SoT)

Tasks include PARITY, Pattern Matching, Reversal, Stack manipulation, Hamiltonian path verification, Maze solving, and Vending Machine arithmetic. Synthetic datasets are generated to control for contamination and ensure precise distributional shifts ( $\delta$ ), with both in-distribution (ID) and out-of-distribution (OOD) test sets.

Main Results

Accuracy Trends and Prompting Effects

Across all tasks and models, increasing the number of exemplars ("shots") consistently improved accuracy, with best performance typically achieved at 50-100 shots—substantially higher than the few-shot regime often cited in prior literature. The gap in accuracy between models and prompting strategies narrowed as the number of shots increased, indicating that, in the limit, ICL's effectiveness is less dependent on model architecture or prompt style and more on the autoregressive mechanism.

Figure 1: Average accuracy results for all-task, PARITY, and Reversal across prompting strategies and shot counts, highlighting the narrowing gap and performance trends.

Description-based and CoT prompts yielded the highest peak accuracies, but CoT and APO were notably brittle to OOD inputs, with sharp declines as $\delta$ increased. Word Salad prompts, which randomized lexical content, converged to baseline performance with sufficient shots, suggesting that semantic content in prompts is less critical than the statistical regularities present in the exemplars.

Figure 2: Comparison of baseline and word-salad prompts, showing convergence in accuracy with increasing shots.

Distributional Robustness and OOD Sensitivity

ICL demonstrated robustness to changes in the training (exemplar) distribution, such as label imbalance and positionality, but was highly sensitive to shifts in the test distribution. The largest negative slopes in accuracy under OOD conditions were observed for CoT and APO prompts, indicating that adaptive or reasoning-based prompting strategies exacerbate brittleness to distributional shifts.

Figure 3: Complete set of performances per problem, illustrating consistent behaviors and the lack of robustness of CoT to OOD.

Task Complexity and Generalization

Performance varied markedly across tasks, even among those with similar formal properties. For example, Pattern Matching (FSA) was solved with high accuracy, while Reversal (PDA) and Maze Solve (FSA) exhibited much lower performance. Traditional baselines (decision trees, kNN, MLP) outperformed ICL in several tasks, particularly under OOD conditions, challenging claims of LLM superiority in generalization.

Ablation Studies

Lexical Features

Randomizing prompt lexical content (Word Salad, SoT) had minimal impact on final accuracy in the limit, provided exemplars remained fixed. This indicates that LLMs rely more on statistical patterns in the data than on semantic understanding of the prompt.

Exemplar Positionality

Shuffling or randomizing exemplar order had negligible effect on accuracy, contradicting prior claims of LLM sensitivity to exemplar ordering.

Alternate Distributions

Imbalanced or randomly labeled exemplars did not degrade performance, but fully randomizing exemplars reduced accuracy, consistent with bias-variance tradeoff phenomena.

Compliance vs. Learning

Separating parsing errors from mislabeling revealed that compliance with output format can mask true learning performance, especially in prompts with high error rates.

Figure 4: Comparison of compliance and learning metrics, showing the impact of parsing errors on perceived performance.

Discussion

The findings support the claim that ICL constitutes a form of learning, as defined by PAC theory, but with significant caveats. ICL's learning is tightly coupled to the autoregressive paradigm and the representativeness of the observed data. The ad hoc encoding mechanism—reliance on prompt and exemplars—limits cross-task generalizability and robustness to OOD inputs. Adaptive prompting strategies (CoT, APO) enhance performance in-distribution but increase brittleness to distributional shifts.

The observed pathologies, such as overfitting to spurious features and inconsistent performance across formally similar tasks, mirror well-known issues in classical machine learning. The results also highlight the importance of comprehensive evaluation across multiple prompts, shots, and distributions to avoid spurious conclusions about LLM capabilities.

Implications and Future Directions

Practically, the paper suggests that deploying LLMs for unseen tasks via ICL requires careful consideration of exemplar quantity, prompt design, and distributional representativeness. Theoretically, it challenges the notion of emergent generalization in LLMs, emphasizing the limitations of autoregressive inference for robust learning.

Future research should explore reasoning models with baked-in CoT, their performance on context-sensitive languages, and further characterize the boundaries of ICL as a learning paradigm. The brittleness to OOD and inconsistency across tasks underscore the need for more principled approaches to prompt engineering and evaluation.

Conclusion

This work provides a comprehensive empirical and theoretical characterization of ICL in LLMs, demonstrating that while ICL is a valid learning mechanism, its generalization and robustness are fundamentally constrained by the autoregressive paradigm and the statistical properties of the prompt and exemplars. The nuanced accuracy trends, sensitivity to OOD, and prompt dependence have significant implications for both the deployment and evaluation of LLMs in real-world and research settings.