In-Context Examples Selection
- In-context examples selection is the process of choosing and ordering demonstration examples to maximize prediction quality in LLMs.
- It leverages similarity-based, diversity-enhanced, and structure-aware methods to reduce redundancy and boost information coverage.
- Recent research shows that optimal example selection significantly improves generalization, sample efficiency, and robustness across diverse tasks.
In-context examples selection refers to the methods and algorithms used to identify, from a larger pool, the optimal set and ordering of demonstration examples to include in the prompt for in-context learning (ICL) with LLMs or analogous architectures. The quality and relevance of these examples directly influence ICL generalization, sample efficiency, and robustness across domains. The following sections review the principal directions, algorithms, empirical results, and current open problems in in-context example selection, with a focus on findings from recent research literature.
1. Motivation, Problem Formulation, and Basic Principles
ICL enables LLMs to perform new tasks by conditioning on a handful of labeled demonstration examples; no parameter tuning is involved. The core technical challenge is: given a pool of candidates , and a query instance , select a set of size (often subject to token budget constraint) such that the expected prediction quality for (and possibly a distribution of queries) is maximized. The combinatorial nature of the selection (and, in many contexts, the ordering) makes naive approaches intractable or suboptimal.
Common heuristics and frameworks include:
- Sampling randomly (serving as a lower bound and a baseline).
- Similarity-based retrieval (dense or sparse embeddings, n-gram overlap, BM25, or more specialized metrics).
- Coverage-maximizing or diversity-enhancing selection (to reduce redundancy and increase information spread).
- Machine-learned or model-aware scoring (regression-based, influence-based, Bayesian reasoning, sequential or RL-inspired methods).
- Task- or domain-adapted selection, leveraging idiosyncratic properties such as syntax, error type, or program structure.
The performance sensitivity to example choice has been established across modalities (NLP, vision, speech, code), tasks (classification, generation, translation, parsing), and LLM families.
2. Similarity-Based and Coverage-Driven Selection
Lexical and semantic similarity: The most widely adopted paradigm retrieves the most similar instances by cosine similarity in pretrained embedding space (e.g., SBERT, CLIP), lexical overlap (BM25, BLEU), or n-gram recall (Agrawal et al., 2022, Zhang et al., 2023, Zebaze et al., 2024). For in-context MT, BM25-based selection and its n-gram coverage-enhancing extension (R-BM25) significantly exceed random sampling in both in-domain and out-of-domain scenarios, yielding up to +20 BLEU improvements in domain transfer (Agrawal et al., 2022).
BERTScore-Recall and set-level coverage: For tasks exhibiting compositional or multi-step structure (semantic parsing, math QA), selection via BERTScore-Recall (BSR) or similar coverage metrics demonstrably improves ICL by favoring demonstrations that collectively "cover" the salient tokenwise embedding features of the test query (Gupta et al., 2023). The submodular "Set-BSR" maximizes set-level coverage and is optimized greedily, giving up to +17% EM in compositional splits and outperforming trained retrievers.
Diversity-enhanced MMR: Pure top- similarity selection leads to redundant, nearly-duplicate contexts in both sparse and dense spaces. Maximum Marginal Relevance (MMR)-style reranking greedily balances similarity to the query with diversity among the selected examples. Empirical results on classification challenges show consistent gains as context size grows, especially above (Kapuriya et al., 3 May 2025). The trade-off parameter controlling diversity is best tuned per task, but values in typically yield robust results.
Linear regression-based theory: Explicit diversity aids ICL via two mechanisms—ensuring sufficient coverage of the test feature space, and balancing parameter estimation error in under-sampled subspaces—both rigorously analyzed using the ICL-as-linear-regression framework (2505.19426). Theoretical results confirm that diversity is essential for complex, multi-step reasoning, OOD robustness, and "hard" splits, although pure similarity may suffice (or even excel) on single-step or IID tasks.
3. Syntactic and Structure-Aware Example Selection
Dependency- and syntax-based selection: For translation and syntax-heavy tasks, example selection based on deep syntactic structures (i.e., dependency trees or ASTs) has yielded state-of-the-art results. Syntax similarity is operationalized via tree kernels (counting shared subtrees), or by embedding dependency trees as multivariate polynomials and computing bidirectional Polynomial Distance (Manhattan distance of term vectors) (Tang et al., 2024, Tang et al., 2024).
In machine translation, pairing polynomial-distance–based syntax selection with BM25 lexical selection and concatenating both types in the prompt maximizes coverage of latent linguistic patterns, such as subordination, tense or construction. Ensemble (BM25+Polynomial) selection yielded the highest COMET scores in 11 of 12 translation directions, surpassing both word-overlap and learned regression scoring baselines (e.g., CTQScorer) (Tang et al., 2024).
In grammatical error correction, stratified two-stage pipelines (fast BM25/embedding filtering, followed by rigorous structure-based reranking) consistently provided +2 to +4 Fâ‚€.â‚… improvements over surface-matching (Tang et al., 2024). Parsing ungrammatical inputs via specialized parsers (GOPar), and upweighting error-tagged substructures, further boosts performance.
SCOI: Syntax-augmented coverage: Recent advances exploit set-level coverage of syntactic features by extracting term vectors via a tree-to-polynomial algorithm and maximizing average "path" coverage (as opposed to pairwise similarity), then alternating greedy selection by syntax and word coverage (Tang et al., 2024). Alternating these metrics avoids overcovering superficial aspects and ensures demonstration sets expose both common and rare syntactic templates, outperforming all learning-free baselines and sometimes even regression-based CTQScorer.
4. Learning-Based, Sequential, and Model-aware Techniques
Learned regression models (CTQScorer/ensemble): Supervised regression over multi-feature vectors has been shown to outperform all single-feature or similarity-based baselines in MT by blending semantic, quality, length, and familiarity metrics (Kumar et al., 2023). CTQScorer, trained on COMET or COMET-QE "oracle" scores, generalizes across languages and can incorporate additional task-specific features.
Sequential and RL-based selection: Frame selection as a sequential MDP or Markov Decision Process. Approaches such as Se² (Liu et al., 2024) model example selection as sequential decisions where the addition of each example is conditioned on the preceding sequence and test input; LLM feedback provides the reward signal. Beam search over a learned bi-encoder yields diverse and contextualized sequences, achieving 42% relative improvement over random, outperforming UPRISE and AES retrievers across 23 diverse NLP tasks. RL approaches using offline Q-learning and conservative regularization (e.g., CQL) discover generalizable policies that transfer modestly to new tasks or models, with an average +5.8% gain on held-out tasks for GPT-2 (Zhang et al., 2022).
Bayesian inference and inverse example selection: The ByCS framework formalizes ICL example selection in terms of maximizing the posterior 0, approximated via inverse inference 1 (Wang et al., 2024). By decoding inverse-predicted "labels" for candidates and ranking by text similarity to the ground truth, ByCS captures model-specific input/output interactions. This approach generalizes across modalities (speech, text, vision) and consistently outperforms vanilla similarity-based selection, especially for open-ended response spaces.
Influence-based and data compression selection: Example influence can be estimated via marginal changes in ICL performance across sampled subsets. Positive- and negative-influence sets exhibit up to 16.3% test accuracy variance (Nguyen et al., 2023). Alternative methods approximate dataset compression: select a candidate pool for each query with high BM25 recall, then rerank using influence functions derived from Fisher meta-gradients in the early layers of the LLM (Sun et al., 2024). Combined influence and relevance scoring achieves +5.9% average improvement across NLP benchmarks.
5. Example Generation and Pool Construction
Demonstration Augmentation for Translation (DAT): When human-annotated pools are unavailable (low-resource settings), prompt the LLM to generate synthetic example pairs relevant and diverse with respect to the query (Lee et al., 31 May 2025). An MMR-style criterion is applied among generated candidates (measured by multi-n-gram recall to the query and pairwise similarities), filtering for coverage and minimal redundancy. These synthetic pools can outperform fixed few-shot and retrieval-based baselines, especially when dynamic accumulation is employed at test time.
Hybrid static-dynamic selection (STAYKATE): In domains with limited annotation budgets, maximize information retention by combining static representativeness sampling (from unsupervised data, using entropy-based "anchors") and dynamic similarity retrieval over labeled pools (Zhu et al., 2024). This hybrid method improves F1 on scientific NER and anchors subtle semantics that dynamic-only retrieval often misses.
Analysis of LLM context scaling: In long-context, many-shot regimes, the bottleneck is increasingly the size of the example pool rather than the effectiveness of fine-grained selection (Baek et al., 2024). When 2 is large, all sophisticated selection methods converge to random sampling, with context utilization and lightweight data augmentation the main determinants of further gain.
6. Empirical Findings, Comparisons, and Best Practices
The table summarizes major selection approaches, representative algorithms, and relative empirical gains as reported in the literature.
| Approach | Key Algorithm(s)/Heuristic | Notable Empirical Gain |
|---|---|---|
| Similarity-based | BM25, n-gram recall, dense retriever | +2–20 BLEU (MT OOD); +5–11 laCOMET (low-res MT) |
| Coverage/diversity-based | BERTScore-Recall, Set-BSR, MMR rerank | +9–17 EM (semantic parsing); +2–10 absolute (complex tasks) |
| Syntactic structure | Polynomial Distance, Tree Kernels | +2–3 COMET (MT); +2–4 F₀.₅ (GEC); 11/12 top directions |
| Supervised regression | CTQScorer | +2–6 COMET (MT, multilingual) |
| Sequential/RL | Se², Active Q-learning, AES | +4–12% accuracy; 42% rel gain vs random |
| Bayesian/inverse inf. | ByCS | –7–10% WER, +3 acc (open-ended) vs similarity baselines |
| Influence/Compression | Data compression, influence scores | +5.9% accuracy/F1 (NLP tasks) |
| Example generation | DAT, progressive accumulation | +1.3–2.8 COMET (low-res MT) |
Best practices distilled from these works include:
- Optimize demonstration set for coverage on reasoning- or structure-rich tasks; prioritize similarity for single-step/IID classification.
- For translation and syntax-sensitive tasks, ensemble lexical and syntactic selection is more robust than either alone.
- Incorporate model or task-specific signals, e.g., through LLM-in-the-loop, inverse inference, or specialized regression for higher order gains.
- In many-shot, long-context regimes (LCLMs), random sampling suffices if context is fully utilized; focus on data augmentation when pool size is insufficient.
- Always precompute relevant embeddings, parse trees, or feature sets for efficiency.
- Tune coverage, diversity, or MMR trade-off parameters per task on held-out validation to maximize incremental benefit.
7. Open Problems, Limitations, and Future Directions
Current limitations in in-context example selection include:
- Most methods ignore cross-example interactions except for explicit sequential or set-based objectives and may not efficiently model higher-order dependencies.
- Some high-performing strategies (e.g., structure-based or regression) are computationally intensive at retrieval time, motivating hybrid cascades or two-stage filtering.
- Empirical generalization is predominantly evaluated in high-resource tasks and languages; extending to true low-resource zero-shot regimes (with sparse or synthetic demonstration pools) remains challenging.
- Theories of why particular strategies succeed are recent (linear regression, diversity-coverage trade-offs) and may not directly transfer to all LLM architectures or tasks.
- Example order within the prompt is a significant source of variance, especially for small 3 and non-robust LLMs; approaches that jointly optimize selection and ordering are nascent.
Open research avenues include joint optimization of example selection and order, integration of coverage/diversity with active learning, scaling structure-aware selection to low-resource and non-English domains, end-to-end learning of retrieval criteria from reward signals, and extending influence-based and Bayes-driven strategies to multitask or chain-of-thought prompts.
The field is rapidly evolving, with a significant shift from simple similarity-based heuristics to multifaceted, model-aware, and task-tailored selection pipelines. Methods that synergize similarity, diversity, and structure while adapting to the resource profile and reasoning demands of the task are proving most effective (Tang et al., 2024, Gupta et al., 2023, 2505.19426, Tang et al., 2024).