- The paper introduces Curriculum Demonstration Selection (CDS), a method that improves in-context learning in LLMs by partitioning examples by difficulty and selecting diverse examples from each level.
- Experiments across various models and tasks like mathematical reasoning, commonsense reasoning, and code generation show CDS consistently outperforms baseline methods.
- The study suggests that the diversity of difficulty levels in demonstrations is key for better generalization, particularly on harder problems, rather than the specific ordering of examples.
The paper investigates how LLMs can be made more effective at in-context learning by carefully selecting which examples to show them. In in-context learning, an LLM is given a few examples (demonstrations) within its input prompt to guide its performance on a new task. While previous approaches have often relied on simply choosing examples that are similar to the test case, this work takes a different direction by organizing the examples by difficulty. The main idea is to expose the model to a balanced mix of examples ranging from simple to complex—much like a curriculum—to help the model gradually improve its ability to tackle varied tasks.
Motivation and Importance
- In-Context Learning Limitations:
LLMs can perform tasks without additional training by looking at examples provided in the prompt. However, their performance can vary greatly depending on which examples are chosen. Traditional methods like random sampling or selecting only the most similar examples may not cover the full range of complexities needed for robust learning.
The concept of curriculum learning comes from the idea that learning can be more effective when it starts with simpler examples and gradually moves to more challenging ones. This paper’s approach applies this intuition to demonstration selection in LLM prompts.
Core Idea: Curriculum Demonstration Selection (CDS)
- Partitioning by Difficulty:
The method begins by dividing a pool of training examples into several groups based on their measured complexity. For instance, in tasks such as math reasoning, examples might be organized by grade level or other complexity measures. This way, each group represents a different level of difficulty.
- Selecting Diverse Examples:
Once the dataset has been partitioned, one demonstration is selected from each difficulty group. This ensures that the final set of in-context examples covers a broad spectrum of challenges rather than clustering around a single difficulty level.
- Retrieval Methods:
- Similarity-Based Retrieval:
- An embedding model converts each example into a numerical representation, and the similarity between these representations and the test query is computed. The example most similar to the test query in each difficulty group is then chosen.
- Random Retrieval:
- Alternatively, one can randomly pick an example from each group. In experiments, both methods improved performance over a baseline of simply random selection from the whole set, with similarity-based selection often giving an edge.
- Demonstration Order:
After collecting the examples, they are typically shuffled before being used in the prompt. This shuffling avoids any unintended biases from a fixed ordering (such as always showing examples from easiest to hardest) and shows that it is the diversity of difficulty that matters most rather than their order.
Experimental Evaluation
- Benchmarks and Tasks:
- Mathematical Reasoning:
- Using a dataset where problems are categorized into different complexity levels (e.g., algebra, geometry, number theory), the curriculum approach helped the models perform better overall, with substantial boosts on the more challenging problems.
- Commonsense Reasoning:
- Through a dataset that mixes questions from different grade levels, the approach showed consistent improvements over baseline methods.
- Code Generation:
- Using a code benchmarking dataset, the method improved both the correctness and efficiency of the generated code.
- Results Across Models:
Various open-source LLMs like Llama-2 (both 7B and 13B parameter versions), Llama-3, Mistral, and Qwen were tested. Across these models, the curriculum-based selection consistently yielded better performance compared to methods that did not incorporate a curriculum.
- Sensitivity to Example Ordering:
Experiments also showed that ordering the examples by increasing difficulty did not necessarily improve performance compared to random ordering. This suggests that simply having a diverse set of demonstrations is the key factor for improved performance, rather than the specific sequence.
Key Takeaways and Recommendations
- Broader Coverage in Demonstrations:
By ensuring that demonstrations include examples across all difficulty levels, LLMs are better equipped to handle both simple and complex tasks.
The curriculum-based method (CDS) tends to be more stable and reliable across various test cases, particularly for problems that are more difficult—where previous methods might struggle.
When setting up in-context learning prompts, practitioners should consider not just the similarity of examples to the target query but also the diversity in the complexity of those examples. This approach has been shown to lead to a significant improvement in the performance of LLMs.
In summary, the paper presents a new demonstration selection method that organizes training examples by difficulty before presenting them to LLMs. This curriculum-based selection leads to better generalization, particularly on more challenging problems, and offers an effective way to harness the full potential of in-context learning in LLMs.