Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
116 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
24 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
35 tokens/sec
2000 character limit reached

Towards Compute-Optimal Many-Shot In-Context Learning (2507.16217v1)

Published 22 Jul 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Long-context LLMs are able to process inputs containing up to several million tokens. In the scope of in-context learning (ICL), this translates into using hundreds/thousands of demonstrations in the input prompt, enabling many-shot ICL. In practice, a fixed set of demonstrations is often selected at random in many-shot settings due to (1) high inference costs, (2) the benefits of caching and reusing computations, and (3) the similar performance offered by this strategy compared to others when scaled. In this work, we propose two straightforward strategies for demonstration selection in many-shot ICL that improve performance with minimal computational overhead. Our first method combines a small number of demonstrations, selected based on their similarity to each test sample, with a disproportionately larger set of random demonstrations that are cached. The second strategy improves the first by replacing random demonstrations with those selected using centroids derived from test sample representations via k-means clustering. Our experiments with Gemini Pro and Flash across several datasets indicate that our strategies consistently outperform random selection and surpass or match the most performant selection approach while supporting caching and reducing inference cost by up to an order of magnitude. We also show that adjusting the proportion of demonstrations selected based on different criteria can balance performance and inference cost in many-shot ICL.

Summary

  • The paper presents compute-optimal demonstration selection methods that combine similarity-based and cacheable random strategies to achieve near-optimal performance in many-shot settings.
  • Hybrid methods reduce inference cost by reusing cached demonstrations and only updating a small dynamic subset per test sample.
  • Empirical results demonstrate that these strategies outperform random selection while matching the benefits of full similarity-based selection at a fraction of the cost.

Compute-Optimal Demonstration Selection for Many-Shot In-Context Learning

This paper addresses the challenge of demonstration selection in many-shot in-context learning (ICL) for long-context LLMs. As LLMs scale to multi-million token contexts, practitioners increasingly employ hundreds or thousands of demonstrations in a single prompt. However, the naive approach of random demonstration selection, while cache-friendly and computationally efficient, is suboptimal in terms of performance. Conversely, similarity-based selection, which tailors demonstrations to each test sample, is computationally prohibitive due to the inability to cache and reuse prompt computations. The authors propose two hybrid strategies that combine the performance benefits of similarity-based selection with the efficiency of cacheable selection, and empirically demonstrate that these methods achieve near-optimal performance at a fraction of the inference cost.

Problem Formulation and Motivation

In many-shot ICL, the input prompt consists of a large set of demonstrations (examples) and a test query. The two main constraints are:

  • Inference Cost: The quadratic complexity of the attention mechanism in Transformers makes per-sample prompt customization expensive, especially as the number of demonstrations increases.
  • Diminishing Returns of Selection Criteria: As the number of demonstrations grows, the marginal benefit of sophisticated selection (e.g., similarity) over random selection decreases, as shown both in prior work and in the authors' experiments.

The practical goal is to maximize ICL performance while minimizing inference cost, ideally by leveraging caching mechanisms.

Proposed Hybrid Selection Strategies

The authors introduce two strategies that partition the demonstration set into a small, dynamic subset and a large, static (cacheable) subset:

  1. Hybrid Similarity-Random Selection: For each test sample, select ss demonstrations most similar to the test query (using embedding-based cosine similarity), and combine them with rr randomly selected demonstrations that are fixed and cached across all test samples. Typically, rsr \gg s (e.g., s=20s=20, r=80r=80 in a 100-shot prompt).
  2. Hybrid Similarity-kk-Means Selection: Replace the random cached subset with a diverse set of demonstrations selected via kk-means clustering. Centroids are computed from test sample representations, mapped to the demonstration pool, and the closest demonstrations to each centroid are selected. This ensures diversity and semantic coverage in the cached set.

Both strategies allow the majority of the prompt to be cached, with only a small per-sample dynamic component, thus reducing the number of unique forward passes required during inference.

Implementation Details

  • Similarity Computation: Embeddings are obtained using a high-quality embedding model (e.g., Gecko), and cosine similarity is used for nearest neighbor search.
  • kk-Means Clustering: The number of clusters is determined via the elbow method, and centroids are computed on test sample embeddings. Demonstrations closest to each centroid are selected for the cached set.
  • Prompt Construction: The prompt is constructed by concatenating the cached demonstrations (random or kk-means) with the per-sample similar demonstrations and the test query.
  • Caching: Key-value caching is used to avoid recomputation for the static portion of the prompt, compatible with standard LLM inference APIs.

Pseudocode for Hybrid Similarity-Random Selection

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def select_demonstrations(test_sample, demo_pool, s, r, embedding_model):
    # Compute embeddings for test sample and demo pool
    test_emb = embedding_model.encode([test_sample])
    demo_embs = embedding_model.encode(demo_pool)
    # Compute similarities
    similarities = cosine_similarity(test_emb, demo_embs)[0]
    # Select top-s similar demonstrations
    similar_indices = np.argsort(similarities)[-s:]
    similar_demos = [demo_pool[i] for i in similar_indices]
    # Select r random demonstrations (fixed for all test samples)
    random_demos = random.sample(demo_pool, r)
    # Construct prompt
    prompt = random_demos + similar_demos + [test_sample]
    return prompt

Pseudocode for Hybrid Similarity-kk-Means Selection

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def select_kmeans_demonstrations(test_samples, demo_pool, k, embedding_model):
    # Compute embeddings
    test_embs = embedding_model.encode(test_samples)
    demo_embs = embedding_model.encode(demo_pool)
    # Run k-means on test sample embeddings
    kmeans = KMeans(n_clusters=k).fit(test_embs)
    centroids = kmeans.cluster_centers_
    # For each centroid, find closest demo(s)
    selected_demos = []
    for centroid in centroids:
        distances = np.linalg.norm(demo_embs - centroid, axis=1)
        idx = np.argmin(distances)
        selected_demos.append(demo_pool[idx])
    return selected_demos  # Used as the cached set

Empirical Results

Experiments are conducted on Gemini Pro and Flash models across diverse datasets (ANLI, TREC, GSM Plus, MetaTool, and BBH subsets). Key findings include:

  • Performance: Both hybrid strategies consistently outperform random selection and match or exceed the performance of full similarity-based selection, especially in the many-shot regime.
  • Inference Cost: The hybrid methods reduce inference cost by up to an order of magnitude compared to per-sample similarity-based selection, due to the ability to cache the majority of the prompt.
  • Pareto Efficiency: Pareto analysis shows that the hybrid methods achieve a favorable trade-off between performance and cost, dominating both random and similarity-only baselines.
  • Low-Data Regime: Even when the demonstration pool is small, tuning the ratio of similar to cached demonstrations yields 3–6% performance gains over using the full pool without selection.

Theoretical and Practical Implications

  • Scalability: The hybrid strategies scale linearly with the number of cached tokens, in contrast to the quadratic scaling of per-sample prompt customization.
  • Prompt Optimization: The ratio of similar to cached demonstrations can be treated as a hyperparameter, allowing practitioners to balance performance and cost for specific applications.
  • Generalization: The kk-means-based approach encourages diversity and semantic coverage, which is beneficial in tasks with heterogeneous test distributions.
  • Compatibility: The methods are agnostic to the underlying LLM and caching implementation, making them broadly applicable in production settings.

Limitations and Future Directions

  • Embedding Quality: The effectiveness of similarity-based selection depends on the quality of the embedding model. Poor embeddings may degrade performance.
  • Dynamic Test Distributions: The kk-means approach assumes access to test sample representations; in streaming or online settings, incremental clustering or adaptive selection may be required.
  • Demonstration Pool Size: In extremely low-resource settings, the benefits of selection diminish, but the hybrid approach still outperforms naive baselines.

Future work may explore adaptive selection strategies that dynamically adjust the ratio of similar to cached demonstrations based on observed performance, as well as extensions to multimodal and multilingual ICL scenarios.

Conclusion

The paper provides a principled and practical solution to the demonstration selection problem in many-shot ICL for long-context LLMs. By combining a small, dynamic set of similar demonstrations with a large, cacheable set (random or kk-means-selected), the proposed strategies achieve near-optimal performance with substantially reduced inference cost. These methods are immediately applicable to large-scale LLM deployments and offer a robust framework for compute-efficient prompt engineering in the many-shot regime.

Youtube Logo Streamline Icon: https://streamlinehq.com