Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Curriculum Demonstration Selection for In-Context Learning (2411.18126v2)

Published 27 Nov 2024 in cs.CL

Abstract: LLMs have shown strong in-context learning (ICL) abilities with a few demonstrations. However, one critical challenge is how to select demonstrations to elicit the full potential of LLMs. In this paper, we propose Curriculum Demonstration Selection (CDS), a novel demonstration selection method for ICL. Instead of merely using similarity, CDS additionally partitions samples by their complexity measurements. Following curriculum learning, CDS then selects demonstrations from easy to difficult. Thus the selected demonstrations cover a wide range of difficulty levels, enabling LLMs to learn from varied complexities within the training set. Experiments demonstrate that our CDS consistently outperforms baseline methods, achieving notable improvements across nine LLMs on three benchmarks. Moreover, CDS proves especially effective in enhancing LLM performance in solving challenging problems.

Summary

  • The paper introduces Curriculum Demonstration Selection (CDS), a method that improves in-context learning in LLMs by partitioning examples by difficulty and selecting diverse examples from each level.
  • Experiments across various models and tasks like mathematical reasoning, commonsense reasoning, and code generation show CDS consistently outperforms baseline methods.
  • The study suggests that the diversity of difficulty levels in demonstrations is key for better generalization, particularly on harder problems, rather than the specific ordering of examples.

The paper investigates how LLMs can be made more effective at in-context learning by carefully selecting which examples to show them. In in-context learning, an LLM is given a few examples (demonstrations) within its input prompt to guide its performance on a new task. While previous approaches have often relied on simply choosing examples that are similar to the test case, this work takes a different direction by organizing the examples by difficulty. The main idea is to expose the model to a balanced mix of examples ranging from simple to complex—much like a curriculum—to help the model gradually improve its ability to tackle varied tasks.


Motivation and Importance

  • In-Context Learning Limitations:

LLMs can perform tasks without additional training by looking at examples provided in the prompt. However, their performance can vary greatly depending on which examples are chosen. Traditional methods like random sampling or selecting only the most similar examples may not cover the full range of complexities needed for robust learning.

  • Curriculum Inspiration:

The concept of curriculum learning comes from the idea that learning can be more effective when it starts with simpler examples and gradually moves to more challenging ones. This paper’s approach applies this intuition to demonstration selection in LLM prompts.


Core Idea: Curriculum Demonstration Selection (CDS)

  • Partitioning by Difficulty:

The method begins by dividing a pool of training examples into several groups based on their measured complexity. For instance, in tasks such as math reasoning, examples might be organized by grade level or other complexity measures. This way, each group represents a different level of difficulty.

  • Selecting Diverse Examples:

Once the dataset has been partitioned, one demonstration is selected from each difficulty group. This ensures that the final set of in-context examples covers a broad spectrum of challenges rather than clustering around a single difficulty level.

  • Retrieval Methods:
    • Similarity-Based Retrieval:
    • An embedding model converts each example into a numerical representation, and the similarity between these representations and the test query is computed. The example most similar to the test query in each difficulty group is then chosen.
    • Random Retrieval:
    • Alternatively, one can randomly pick an example from each group. In experiments, both methods improved performance over a baseline of simply random selection from the whole set, with similarity-based selection often giving an edge.
  • Demonstration Order:

After collecting the examples, they are typically shuffled before being used in the prompt. This shuffling avoids any unintended biases from a fixed ordering (such as always showing examples from easiest to hardest) and shows that it is the diversity of difficulty that matters most rather than their order.


Experimental Evaluation

  • Benchmarks and Tasks:
    • Mathematical Reasoning:
    • Using a dataset where problems are categorized into different complexity levels (e.g., algebra, geometry, number theory), the curriculum approach helped the models perform better overall, with substantial boosts on the more challenging problems.
    • Commonsense Reasoning:
    • Through a dataset that mixes questions from different grade levels, the approach showed consistent improvements over baseline methods.
    • Code Generation:
    • Using a code benchmarking dataset, the method improved both the correctness and efficiency of the generated code.
  • Results Across Models:

Various open-source LLMs like Llama-2 (both 7B and 13B parameter versions), Llama-3, Mistral, and Qwen were tested. Across these models, the curriculum-based selection consistently yielded better performance compared to methods that did not incorporate a curriculum.

  • Sensitivity to Example Ordering:

Experiments also showed that ordering the examples by increasing difficulty did not necessarily improve performance compared to random ordering. This suggests that simply having a diverse set of demonstrations is the key factor for improved performance, rather than the specific sequence.


Key Takeaways and Recommendations

  • Broader Coverage in Demonstrations:

By ensuring that demonstrations include examples across all difficulty levels, LLMs are better equipped to handle both simple and complex tasks.

  • Improved Robustness:

The curriculum-based method (CDS) tends to be more stable and reliable across various test cases, particularly for problems that are more difficult—where previous methods might struggle.

  • Practical Application:

When setting up in-context learning prompts, practitioners should consider not just the similarity of examples to the target query but also the diversity in the complexity of those examples. This approach has been shown to lead to a significant improvement in the performance of LLMs.


In summary, the paper presents a new demonstration selection method that organizes training examples by difficulty before presenting them to LLMs. This curriculum-based selection leads to better generalization, particularly on more challenging problems, and offers an effective way to harness the full potential of in-context learning in LLMs.

X Twitter Logo Streamline Icon: https://streamlinehq.com