Scaling with In-Context Exemplars

Updated 18 September 2025

Scaling with in-context exemplars is a technique that uses diverse and numerous prompt examples to improve model performance and reduce compositional generalization gaps.
It leverages innovations like structured prompting and efficient transformers, enabling long-context processing and significant accuracy gains while reducing evaluation variance.
Adaptive and automated exemplar selection methods optimize diversity, relevance, and ordering, yielding practical improvements in robustness and transferability across tasks.

Scaling with in-context exemplars refers to the systematic effects—both empirical and theoretical—observed when the number, diversity, and selection strategy of prompt-based examples provided to a pretrained model are increased, with the goal of improving the performance, robustness, and generality of in-context learning (ICL). This topic encompasses architectural, algorithmic, statistical, and practical considerations that govern how large language and vision models respond as more and better-selected demonstrations are incorporated. The following sections delineate the foundational concepts, methodological developments, empirical trends, theoretical analyses, and open research directions regarding scaling with in-context exemplars, drawing exclusively upon authoritative research literature.

1. Theoretical Foundations: ICL, Generalization Gaps, and Scaling Laws

In-context learning is characterized by a model’s ability to perform new tasks solely by conditioning on a set of demonstration exemplars present in the context, without updating model parameters. The effectiveness of ICL thus depends on how information from these exemplars is processed and generalized.

A key metric introduced is the compositional generalization gap—the relative loss in performance when models face out-of-distribution (OOD) compositions compared to in-distribution (ID) ones. This is formalized as:

$\text{Relative Gap} = \frac{\text{ID} - \text{OOD}}{\text{ID}}$

As models are scaled (i.e., more parameters), the relative gap decreases, indicating improved robustness to novel compositional combinations in semantic parsing tasks (Hosseini et al., 2022).

Recent work frames ICL as approximate Bayesian inference, showing that model improvements with increasing exemplars can be captured by Bayesian scaling laws. The expected probability assigned to the correct prediction with $n$ examples formally satisfies:

$\mathbb{E}_D[p(o|D)] = \frac{\sum_m (P_{a,m})^{K n + 1} p_m}{\sum_m (P_{a,m})^{K n} p_m}$

where $K$ denotes ICL efficiency, $p_m$ is the task prior, and $P_{a,m}$ is the per-example probability under task $m$ (Arora et al., 21 Oct 2024). These laws subsume earlier power-law trends and offer interpretability for how task prior, per-example information, and sample size govern scaling.

2. Model Architectures and Efficient Scaling Beyond Few-Shot

Architectural constraints—especially input window and attention complexity—have classically limited scaling to few-shot settings. Several innovations overcome these bottlenecks:

Structured Prompting: By dividing demonstrations into $M$ groups and applying right-aligned position embeddings, structured prompting enables efficient context utilization with linear rather than quadratic attention complexity. The test input attends jointly to pre-encoded group representations using rescaled weights, allowing thousands of exemplars per prompt with significant gains (3–5% absolute accuracy on text tasks) and sharply reduced evaluation variance (Hao et al., 2022).
Efficient Transformers for Long-Range Contexts: Architectures such as EvaLM use attention chunking, compression, and circular positional embeddings, supporting testing at unprecedented context lengths (up to 256k tokens), and achieve optimal performance with demonstration sets substantially larger than classical limits (best accuracy at ∼12k token prompts). However, benefits eventually saturate or decrease, suggesting inherent upper bounds determined by model capacity and long-range encoding fidelity (Li et al., 2023).
Structured Attention Models (SAICL): These architectures restrict attention so that each demonstration attends within itself and to the test input, but not to other demonstrations. This reduces attention complexity from $O(k^2 L^2)$ to $O(k L^2)$ (for $k$ demonstrations of length $L$ ), achieves linear scaling, and is inherently permutation-invariant. Empirically, SAICL matches or surpasses full-attention baselines at up to 3.4× speed-up and continues to yield performance improvements as the number of demonstrations increases (Cai et al., 2023).

3. Selection, Ordering, and Diversity of Exemplars

As the number of exemplars increases, their selection and ordering become significant for both efficiency and model performance:

Diversity and Relevance Joint Optimization: CEIL leverages conditional determinantal point processes (DPPs) to model both the relevance of each exemplar to the target and diversity across the demonstration set, defining a kernel:

$\tilde{k}(a_i, a_j|x) = g(a_i, x) k(a_i, a_j) g(a_j, x)$

Parameterizing the tradeoff between relevance and diversity, CEIL achieves state-of-the-art results on multiple NLP tasks, with gains up to 6.4 points on semantic parsing and notable transferability to unseen tasks (Ye et al., 2023).

Ordering and Automated Sequence Optimization: The EASE framework demonstrates that both which exemplars are chosen and their order in the prompt can substantially impact downstream accuracy. EASE formulates exemplar selection as a black-box optimization problem over ordered sequences and applies a neural bandit algorithm to efficiently explore this space, outperforming retrieval-based and subset-selection baselines in 17 out of 19 tasks tested (Wu et al., 25 May 2024).
Adaptive Selection and Redundancy Avoidance: Adaptive-Prompt iteratively selects exemplars guided by model feedback (uncertainty or entropy), dynamically updating the uncertainty of each candidate as the demonstration set grows. This adaptivity avoids redundant knowledge coverage and provides consistent, statistically significant improvements over fixed, non-adaptive selection strategies (e.g., +0.8% on GSM8K) (Cai et al., 23 Dec 2024).
Active Selection via Value Functions: By modeling the ICL process as associative memory retrieval (Hopfield networks), exemplar quality can be analytically decomposed into "instance error" and "contextual error." Selecting exemplars that minimize both—ideally through a Monte Carlo estimate of their empirical value function—improves scaling, especially when context limits force compact, high-utility selection (Zhao, 2023).

4. Impact of Model Scale and Pretraining

Model scaling exhibits non-uniform effects on ICL:

Emergent Capabilities and Priors: Larger LLMs exhibit a transition from reliance on semantic priors (pretraining knowledge) to learning arbitrary input-label mappings provided by in-context exemplars. For example, only sufficiently large models can override semantic priors with flipped labels or succeed in semantically-unrelated label settings (Wei et al., 2023).
Robustness to Down-Scaling: Evidence from dense scaling and weight pruning experiments shows that fact recall degrades rapidly (>30% parameter reduction), but ICL—i.e., the model's ability to process and learn from prompt demonstrations—remains robust even after aggressive parameter reduction (up to 60–70% sparsity). For tasks that leverage context (overriding stored "facts" or learning parameterized functions through exemplars), smaller or pruned models can perform nearly as well as their larger counterparts (Jin et al., 2023).
Pretraining on Code: Exposure to code during pretraining enhances compositional generalization in in-context learning, as seen in the superior OOD compositional performance of Codex and CodeGen over pure natural language pretraining (Hosseini et al., 2022).

5. Limitations, Data Regimes, and the Role of Long Contexts

Limits of Sample Selection in the Many-Shot Regime: With the rise of long context LLMs (LCLMs), prompt length allows hundreds or thousands of exemplars per query. Experimental evidence indicates that when context can be densely filled, sophisticated selection or ordering methods offer little benefit over random sampling. The bottleneck shifts to acquiring or generating enough valid examples to fill the window, including via data augmentation, which can yield ∼5% accuracy improvement in low-resource settings (Baek et al., 22 Dec 2024).
Plateauing and Diminishing Returns: Both empirical results and scaling law analyses reveal performance plateaus: further increasing context length, demonstration count, or prompt diversity beyond a certain point yields diminishing or even negative returns due to model capacity, tokenization, and memory limitations (Li et al., 2023, Baek et al., 22 Dec 2024).
Interaction with Alignment and Safety: The emergence of many-shot "jailbreaking"—restoring capabilities suppressed by fine-tuning through in-context demonstrations—demonstrates that Bayesian scaling laws accurately predict when suppressed behaviors reappear. Alignment post-training mostly affects task priors, but with enough in-context examples, base capabilities resurface as the posterior shifts. This underscores the challenge for safety interventions relying on post-training prior manipulation alone (Arora et al., 21 Oct 2024).

Vision and Logical Reasoning: In large vision models, performance is highly sensitive to prompt selection, with careful retrieval of semantically and structurally relevant exemplars (via supervised or unsupervised methods) yielding substantial gains (up to 8% mIoU in segmentation tasks) (Zhang et al., 2023). For math and logical reasoning, model performance improves when selection leverages not just semantic similarity but also the structural similarity of the underlying reasoning graph, implemented via methods such as graph kernels (RGER) (Lin et al., 17 Sep 2024).
Dialogue and Long-Context Tasks: Stabilizing in-context performance in dialogue state tracking requires meta-learning over prompt sets and saliency-aware compression. This enables inclusion of more exemplars within limited context budgets and consistently reduces variance in model responses (Chen et al., 2023).

7. Open Challenges and Future Directions

The following themes characterize the current frontier:

Upper Bounds and Efficiency: Identifying how far performance can scale given context, memory, and computational constraints is an ongoing topic. Efficient transformer variants and structured attention mechanisms are likely to remain important for pushing these bounds.
Scaling Laws and Interpretability: The adoption of Bayesian and other theoretically justified scaling laws, which include explicit terms for priors, per-example likelihoods, and efficiency, provides both interpretability and predictive fidelity. This invites further research into model-based selection, dynamic adaptation, and interpretability in scaling regimes (Arora et al., 21 Oct 2024).
Cross-Domain Generalization and Transfer: Frameworks such as CEIL demonstrate compositional transferability, and exemplar retrievers trained on one backbone or task domain may generalize to others, suggesting the potential for universal retrieval modules.
Automated and Adaptive Selection: The move toward principled, automated, and adaptive selection mechanisms—ranging from DPP-based subset optimization, neural bandit optimization, active/adaptive feedback-driven strategies, and reinforcement learning with self-evaluation—continues to dominate methodological advances.
Robustness to Distribution Shift and Adversarial Examples: Approaches that integrate deeper semantic and structural features in selection (DQ-LoRe, RGER) outperform simple retrieval even under distributional shift, indicating a promising direction for scaling ICL to more challenging, real-world tasks (Xiong et al., 2023, Lin et al., 17 Sep 2024).
Model Alignment and Safety Trade-offs: Methods such as RIDE exploit restyled demonstration exemplars to tune alignment properties (balancing factuality and safety) without fine-tuning, using structured prompt engineering and hierarchical search, with consistent improvements observed across alignment benchmarks (e.g., +0.22 on just-eval-instruct) (Hua et al., 17 Feb 2025).

Scaling with in-context exemplars is thus defined by an overview of architectural design, statistical learning theory, selection and adaptation strategies, and empirical scaling laws. The field continues to evolve rapidly, guided by both theoretical insights and practical limits imposed by model capacity, computation, and the changing frontier of long context window architectures.