Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 96 TPS
Gemini 2.5 Pro 39 TPS Pro
GPT-5 Medium 36 TPS
GPT-5 High 36 TPS Pro
GPT-4o 74 TPS
GPT OSS 120B 399 TPS Pro
Kimi K2 184 TPS Pro
2000 character limit reached

In-Context Exemplar Selection

Updated 17 August 2025
  • In-context exemplar selection is the process of choosing and ordering informative examples to optimize performance in large language models.
  • It employs diverse methodologies—from heuristic similarity retrieval to reinforcement learning—that can improve accuracy by measurable margins.
  • Learned selection policies transfer across tasks, reduce performance variance, and adapt to model-specific nuances even with unbalanced exemplar sets.

In-context exemplar selection is the process of identifying and assembling the small set of labeled demonstration instances provided to LLMs or other foundation models as part of the prompt for in-context learning (ICL). Since ICL performance often depends strongly on which examples are chosen, their order, diversity, and intrinsic informativeness, the principled selection of these exemplars is a critical lever for maximizing downstream task performance, stability, and generalization. The growing body of research has produced a wide spectrum of methodologies, ranging from heuristic similarity-based retrieval and diversity-augmenting approaches to formal optimization and reinforcement learning frameworks. Recent work also demonstrates that the impact of example selection may diminish for models with strong emergent capabilities or extremely long context windows, yet robust selection remains central in many practical and challenging scenarios.

1. Sensitivity and Instability in Example Selection

The performance of in-context learning is highly sensitive to the identities, composition, and order of the exemplars in the prompt. Empirical findings show that for models such as GPT-2, the accuracy achieved by different random 4-shot prompt draws can vary by over 30 percentage points and, for multi-class tasks, sometimes fall below the level of random guessing (Zhang et al., 2022). This sensitivity can be only partially mitigated by simple strategies such as reordering demonstration examples or prompt calibration (e.g., with content-free inputs for baseline correction). Such stochasticity highlights the need for systematic, data-driven selection policies capable of actively identifying informative, robust demonstration sets.

Example instability is further compounded in settings with complex or distributionally mismatched tasks, where heuristics like lexical similarity, label balance, or naive diversity may not capture the nuanced signals required for optimal model adaptation.

2. Formalizations and Algorithms for Exemplar Selection

To move beyond ad hoc or greedy instance selection, contemporary research formalizes exemplar selection as sequential or combinatorial optimization problems:

  • Markov Decision Process (MDP) Formulation: Example selection is formulated as a sequential decision process, where the policy incrementally selects demonstrations to construct the prompt. The state comprises the current sequence of (input, label) pairs, and the action space is the pool of remaining candidates plus termination (Zhang et al., 2022). Rewards are defined as marginal improvements in model performance upon adding a new example, resulting in a telescoping reward structure.
  • Reinforcement Learning (RL) Approach: Within this MDP, off-policy RL (specifically Q-learning, with a conservative Q-learning regularizer) is used to optimize the sample selection policy. The Q-function is approximated by a neural network over state-action features, which may include the number of selected examples, predicted label probabilities, and entropy for candidate examples. This approach allows the learned policy to generalize to unseen tasks and, in practice, yields average improvements up to 12% over baseline strategies.
  • Reward Shaping and Telescoping Structure: The reward r(s,a)=f(LM[s+a])f(LM[s])r(s, a) = f(\text{LM}[s + a]) - f(\text{LM}[s]) directly aligns the cumulative reward with observed in-context performance gains, ensuring that the sum along a trajectory recapitulates the net benefit of the selected sequence.
  • Baseline and Comparative Methods: The active RL and sequential strategies are benchmarked against random selection, max-entropy active selection, “best-of-n” robust sampling procedures, and instance-level heuristics. RL policies consistently outperform these, particularly in settings with smaller or less robust models.

3. Transferability and Systematicity of Learned Selection Policies

A defining property of learned example selection policies is their ability to generalize:

  • Transfer Across Tasks: Policies trained on one task or set of demonstration examples can be effectively applied to new tasks from the same domain. Empirical evaluation reveals a persistent performance gain ('5.8% on average) over the best non-learned baselines, even in challenging transfer scenarios.
  • Systematic Cues Recovered by Policies: Interestingly, the RL-based selectors often yield counterintuitive but empirically strong demonstration configurations, such as prompts with unbalanced label distributions that perform better than strictly balanced alternatives. This suggests that exemplar selection policies can adapt to the idiosyncratic “reading” or information acquisition patterns of pretrained LMs, which do not always align with human intuitions or traditional statistical criteria.
  • Variance Reduction: The learned selectors are able to reduce output performance variance by focusing on example sets that are more consistently informative, mitigating the risk of poor task adaptation due to unlucky prompt draws.

4. Quantitative and Practical Considerations

Performance and practical deployment are central to any selection method:

Deployment Model Average Improvement over Random Baseline Notes
GPT-2 11.8% On seen tasks, via RL-policy selection (tasks: AGNews, Amazon, SST-2, TREC)
GPT-2 12.1% Over max-entropy baseline
GPT-2 (transfer) 5.8% On unseen tasks
GPT-3 Ada Small positive Transfer of RL-selected demos, but gains diminish as model size rises
GPT-3 Babbage+ Diminished outcomes Larger LMs' in-context robustness eclipses selection gains

For the RL-based example selection approach (Zhang et al., 2022), the computational cost is manageable due to the efficient offline Q-learning variant and minimal feature space. The approach does not require fine-tuning the LLM itself or intensive test-time computation beyond the selection policy application.

However, as model size increases and in-context learning performance becomes more robust inherently, the marginal benefit from sophisticated example selection decreases. This empirical result is particularly pronounced in larger models (e.g., GPT-3 Curie or beyond), indicating emergent in-context generalization capabilities that diminish the relative value of selection effort.

5. Properties and Nuances of Effective Exemplars

The process of learning selection policies reveals nuanced criteria for "good demonstrations":

  • Beyond Label Balance: Policies may favor unbalanced label distributions in some binary classification tasks if such configurations offer lower performance variance or empirically greater benefit for specific model architectures.
  • Idiosyncratic Model Preferences: What constitutes a “good set” is model-dependent: LMs acquire task signal and exhibit context sensitivity in highly non-transparent ways. Learned selection policies uncover these systematic but opaque phenomena, suggesting that explicit optimization is necessary where hand-crafted heuristics cannot access model-specific preferences.
  • Compounding Effects: The order, calibration, and inter-demo synergy—all of which are handled inherently in RL and sequential selection—may have non-negligible impact on performance, particularly at smaller shot counts.

6. Limitations and Open Directions

Despite their substantial benefits, learning-based selection schemes for in-context exemplars face several open challenges and limitations:

  • Scaling to Larger LM Architectures: The observed diminishing returns as LMs gain emergent in-context capabilities raise the question of how selection policies should adapt for next-generation models with even longer context windows or more robust internal representations.
  • Feature Engineering Constraints: The RL framework in (Zhang et al., 2022) uses intentionally minimal features to promote transferability, but richer input representations (e.g., embeddings, syntactic or logical cues) may allow more nuanced or domain-adaptive selection in future work.
  • Prompt Length and Budget Constraints: Approaches are currently tailored to short demonstration budgets (e.g., k=4) and can be susceptible to token-length limitations; extending to longer prompts or variable format is an area for further development.
  • Unsupervised and Heterogeneous Pools: Real-world deployments often involve noisy labels, unlabeled pools, or heterogeneous example sources. Adaptations of these policies and reward functions for unsupervised, semi-supervised, or active learning integration are needed to enhance robustness.
  • Interplay with Order and Calibration: The impact of demo order, prompt calibration, and joint optimization of selection/order/calibration remains incompletely understood.

7. Implications for Research and Practice

The formalization of in-context exemplar selection as a sequential, reward-driven policy learning problem is a foundational advance for improving the stability and reliability of inference-only learning with LLMs. This line of work illustrates that:

  • Prompt construction is a principled, model-sensitive process in which the choice of exemplars acts as an implicit interface to the pretrained knowledge substrate.
  • Active policies yield measurable and robust performance improvements, variance reduction, and generalization to new tasks, especially in non-trivial, low-shot settings.
  • As models scale, the marginal value of sophisticated selection diminishes, but the framework remains critical for cases where model robustness, sample scarcity, or demonstration heterogeneity are primary concerns.

This paradigm sets the stage for subsequent research into learnable, transferable, and adaptive prompt engineering—where model-in-the-loop selection and precise optimization over demonstration subsets will continue to shape the practice of in-context learning in NLP and beyond.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)