In-Context Learning: Methods & Challenges

Updated 28 October 2025

In-context learning is defined as a method where large models use input demonstrations to rapidly adapt to new tasks without modifying their weights.
Rectification techniques are employed to correct noisy labels, ensuring high accuracy even when demonstrations are corrupted.
Careful selection and curriculum ordering of examples are crucial for improving model generalization and robustness across various applications.

In-context learning (ICL) refers to the ability of models—particularly LLMs—to perform a target task solely by conditioning on examples (demonstrations) provided in the input prompt, without modifying model weights. ICL enables rapid adaptation to new tasks by leveraging context, a property that has become central to modern foundation models in both natural language and vision. The field encompasses foundational mechanisms, theoretical frameworks for generalization and sample complexity, architectural innovations for scaling, and practical strategies for robust operation under real-world conditions, such as noisy labels.

1. Formalization and Theoretical Foundations

ICL can be rigorously described as a two-phase process: (1) a pretraining phase, where a model learns a function over a distribution—often a mixture over latent tasks; and (2) a prompt-based inference phase, where a frozen model is fed a sequence of input-output examples (demonstrations) and must generalize to new queries of the same task. The theoretical underpinnings of ICL are captured by PAC (Probably Approximately Correct) frameworks, which formalize the sample complexity and generalization conditions under which ICL is efficient (Wies et al., 2023).

Under mild task-separability assumptions, the number of in-context examples needed for near-optimal performance is polynomial in the inverse of the desired accuracy and confidence, i.e., $k = \mathrm{poly}(1/\epsilon, 1/\delta)$ , provided that tasks in pretraining are sufficiently distinguishable. Importantly, analysis shows that ICL in this regime functions primarily as task identification: the model leverages prompt examples to infer which latent task matches (via differences in data likelihood/KL divergence), and then deploys the corresponding solution from pretraining. This aligns with empirical findings that randomizing label mappings in prompts often does not degrade ICL performance (Wies et al., 2023).

2. Handling Realistic Demonstration Noise

In real-world scenarios, demonstrations used in ICL pipelines are often subject to label corruption due to annotation errors or inherent ambiguity. In-Context Learning with Noisy Labels (Kang et al., 29 Nov 2024) formalizes and addresses this challenge:

Task Setup: Demonstration labels in the context are assumed to be corrupted with rate $r \in [0,1]$ (uniform flipping). The model’s goal is to perform ICL robustly despite this noise.
Baselines: Techniques adapted from supervised noisy-label literature (correction, weighting, reordering, selection) offer partial mitigation, replacing or annotating demonstrations according to classifier confidence.
Rectification Method: The principal innovation is the rectification model—a generative model trained (on a small, clean subset) to process entire noisy demonstration sequences and output a corrected label sequence. By jointly leveraging the full demonstration context, this approach provides superior robustness and stability, maintaining high ICL accuracy even as noise rates approach $r=0.5$ . The rectification accuracy is quantified as

$\tau = \frac{1}{NK}\sum^{N}_{n=1}\sum^{K}_{k=1}\mathbbm{1}(y^k_n = \tilde{y}^k_n)$

with $N$ the number of sets, $K$ demonstrations per set, $y^k_n$ the true, and $\tilde{y}^k_n$ the rectified, label.

Integration: The rectification step is a plug-and-play preprocessor, not requiring LLM parameter updates.

Empirical results demonstrate that, as label noise in demonstrations increases, unmitigated ICL performance degrades rapidly (e.g., 78.4% to 51.5% accuracy at $r=0.5$ ). In contrast, rectification maintains robust performance (76.2–78.4%) and lower variance across runs, outperforming correction and filtering strategies, especially with limited or unreliable auxiliary classifiers (Kang et al., 29 Nov 2024).

3. Example Selection and Diversity in ICL Prompts

The importance of selecting high-quality, diverse, and relevant demonstrations—essential for maximizing ICL performance—has been the subject of extensive empirical and methodological paper:

Selection Sensitivity: The performance of ICL is highly sensitive to which demonstrations are included, particularly in the few-shot regime (Zhang et al., 2023, Li et al., 2023). Poorly selected examples can dramatically degrade performance, while well-chosen ones enhance generalization, stability, and robustness.
Automated Selection: Methods such as LENS (Li et al., 2023) employ a two-stage filter-then-search strategy: first, a progressive InfoScore-based filtering retains globally informative candidates (measured by their impact on LLM predictive distributions), then a diversity-guided iterative search (e.g., beam search with diversity penalization) assembles support sets that maximize both informativeness and diversity.
RL-Optimized Policies: Recent frameworks train parameter-efficient retrieval/reward heads alongside frozen LLMs to optimize demonstration selection and ordering via reinforcement learning, leading to in-context demonstration sets with higher representativeness and diversity, and outperforming dense retrievers and random selection (Long et al., 14 Aug 2024).
Order and Diversity Trade-offs: Optimizing for both semantic similarity (for discrimination gain) and label diversity is critical, as LLMs can otherwise default to copying prevalent labels (diminishing ICL's actual task generalization) (Long et al., 11 Apr 2024).

Results in vision (Zhang et al., 2023, Sun et al., 2023) and NLP show that both relevance and diversity are essential, but as context windows increase (e.g., in LCLMs), the marginal utility of sophisticated selection may diminish relative to sheer prompt size (Baek et al., 22 Dec 2024).

4. Curriculum and Instruction in Demonstration Sequences

The ordering and format of demonstrations directly influence ICL effectiveness. In-Context Curriculum Learning (ICCL) (Liu et al., 16 Feb 2024) proposes that demonstrations be ordered from easy to hard, emulating the principles of curriculum learning. Empirical results demonstrate that curriculum-ordered prompts produce significant performance gains in open-source, instruction-tuned LLMs, especially for more challenging tasks and larger models. Notably, this benefit emerges after instruction tuning, not pretraining, highlighting the importance of prompt structure for parameter utilization efficiency and generalization.

Explicit instructions or hypothesis-class descriptions, as investigated in ICL-HCG (Lin et al., 27 Feb 2025), further enhance model generalization and enable systematic extension of ICL methodologies, with particularly marked improvements when deployed within attention-based architectures (e.g., Transformers, Mamba).

5. Scaling Laws: Context, Task, and Architectural Properties

Distinctions between context-scaling (performance increases with more in-context examples) and task-scaling (performance increases with more pretraining tasks) are pivotal in understanding architectural efficacy for ICL (Abedsoltan et al., 16 Oct 2024). Key findings include:

Transformers naturally exhibit both context- and task-scaling, enabled by their ability to compute data-dependent feature maps over prompts.
Standard MLPs, in contrast, exhibit only task-scaling unless explicitly provided with data-dependent features computed from the context. Hybrid architectures incorporating both vectorized representations and feature maps can recover both scaling behaviors.
ICL effectiveness saturates as demonstration count increases, exhibiting "technical debt"—that is, diminishing efficiency and suboptimality relative to oracle estimators as context grows, due to a non-vanishing excess risk in the many-shot regime (Joo et al., 7 Feb 2025). Theoretically, this inefficiency is intrinsic and cannot be resolved by increasing model size or context window alone.

6. Application Domains, Robustness, and Extensions

The ICL paradigm has catalyzed advances in domains beyond language:

Computer Vision: Visual in-context learning leverages prompt selection and fusion (ensemble over diverse spatial arrangements) for improved segmentation and detection, matching or outperforming meta-learning baselines (Sun et al., 2023). Automated and supervised retrieval methods, as well as pixel-level feature matching, enable context offerings well-matched to query images (Zhang et al., 2023).
Imbalanced Regression: Localized and bias-corrected prompt selection (e.g., via inverse density neighbor retrieval) delivers robust performance on regression tasks with skewed label distributions, outperforming in-weight learning in underrepresented regions (Nejjar et al., 28 May 2024).
Reinforcement Learning: Frameworks such as ICEE balance exploration-exploitation trade-offs at inference by leveraging epistemic uncertainty learned in sequence models, eschewing explicit Bayesian computation or online gradient updates (Dai et al., 11 Mar 2024). Transformers can be trained to approximate temporal-difference methods in their forward pass, establishing ICL as a generic meta-learning mechanism for RL (Wang et al., 22 May 2024).
3D Point Cloud Processing: The Point-In-Context architecture applies prompt-driven, in-context segmentation and multitask learning to 3D data by jointly sampling spatially aligned tokens and leveraging dynamic, context-dependent labeling (Liu et al., 18 Apr 2024).

7. Open Problems and Future Directions

Despite significant progress, challenges remain:

Noisy, Adversarial, or Out-of-Distribution Demonstrations: While rectification and robust selection strategies represent advances, more refined approaches for semi-supervised, streaming, or adversarially perturbed contexts demand attention.
Scaling to Massive Contexts: LCLMs make example selection less sensitive in the many-shot regime, shifting emphasis to data augmentation and context population rather than selection (Baek et al., 22 Dec 2024).
Theoretical Gaps: The quantification of ICL's intrinsic inefficiency at scale ("technical debt") motivates new adaptive methods combining ICL with on-the-fly adaptation or parameter-based updates (Joo et al., 7 Feb 2025).
Algorithmic Mechanisms: Deeper understanding of the interplay between in-context feature construction, attention mechanisms, and higher-order reasoning in demonstrations (e.g., concept-aware or curriculum-based prompting) will further elucidate the limits and strengths of the in-context paradigm.

In summary: In-context learning methods unify a growing spectrum of techniques for leveraging demonstrated context in frozen models for rapid adaptation. The field is now characterized by formal analysis of sample complexity and scaling, algorithmic innovations for selection and robustness (especially under noise), data- and concept-driven curriculum construction, and empirical validation across modalities and tasks. While state-of-the-art methods such as rectification fortify ICL against practical noise, overarching challenges related to efficiency, scaling, and distribution shift remain prominent areas for research.