This paper addresses the performance degradation of LLMs in many-shot in-context learning (ICL) scenarios, where providing a large number of demonstration examples (shots) can lead to a plateau or even a decline in performance. The authors identify two primary causes: a suboptimal negative log-likelihood (NLL) optimization objective and increasing data noise with more demonstrations.
To tackle these issues, the paper introduces DR-ICL, a novel fine-tuning optimization method that enhances many-shot ICL through Differentiated Learning and Advantage-based Reweighting objectives.
- Global Perspective: Differentiated Learning:
- This component aims to ensure that the model's performance with many-shot demonstrations is superior to its zero-shot performance.
- During training, the model is optimized on both many-shot sequences () and their corresponding zero-shot versions (derived using Parallel Context Windows, PCW, to mask context).
- The differentiated loss function is: , where is a hyperparameter. The goal is to make .
- Local Perspective: Advantage-based Reweighting:
- This component dynamically adjusts the weights of individual demonstrations within a many-shot sequence to mitigate the impact of noisy or less informative examples. It is inspired by reinforcement learning's advantage function.
- Importance Sampling: The sequence of demonstrations is divided into reweighting windows of size . For each demonstration in a reweighting window , the preceding window acts as a sampling window. Importance weights are calculated based on the NLL loss of demonstrations () to select most significant demonstrations from the sampling window.
- Advantage Functions: The reward for a demonstration is calculated as the difference between its loss and the average loss of the sampled demonstrations from the previous window : .
- The cumulative advantage is then computed as , where is a temperature parameter. This amplifies positive rewards (performance improvements).
- Reweighting: The NLL loss for many-shot demonstrations is then reweighted: .
Recognizing the lack of suitable benchmarks for many-shot ICL fine-tuning, the authors developed the Many-Shot ICL Benchmark (MICLB).
- MICLB is a large-scale benchmark covering shot numbers from 1 to 350 within sequences up to 8,000 tokens.
- It includes 50 distinct datasets across 7 prominent NLP task types (QA, Reasoning, Summarization, Clustering, Classification, Retrieval, Reranking), totaling over three million samples.
- It facilitates evaluation of many-shot ICL strategies for both in-domain and out-of-domain scenarios.
Experimental Setup:
- Base Models: Llama-2-7b-chat-hf and Mistral-7B-Instruct-v0.2.
- Baselines:
- NFT (No Fine-tuning): Foundational models.
- IT (Instruction Tuning): Fine-tuning with zero-shot examples.
- MetaICL: Fine-tuning with many-shot examples.
- Evaluation Metrics: Accuracy (for QA, clustering, logical reasoning, classification, retrieval), Distinct-3/ROUGE-1/BLEU-1 (for summarization), P@k/R@k/G@k (for reranking).
- Hyperparameters: (0.2 for Llama-2, 0.4 for Mistral), , sampling size , reweighting window size .
Key Results:
- DR-ICL demonstrated significant improvements in many-shot setups across various tasks, for both in-domain and out-of-domain datasets, compared to NFT, IT, and MetaICL.
- It showed more stable performance as the number of shots () increased, mitigating the rise-and-fall trend observed in other methods. For instance, on CLSClusteringS2S, DR-ICL maintained high performance with increasing , while IT's performance degraded and MetaICL showed fluctuations.
- DR-ICL achieved lower performance variance across different -shot values (average variance of 1.56E-03 compared to MetaICL's 2.38E-03).
- The advantage-based reweighting mechanism effectively reduced sensitivity to data noise, leading to more stable loss convergence during training.
- Ablation studies confirmed the contributions of both the global differentiated learning and local reweighting components. For example, on WinoWhy, DR-ICL achieved an average accuracy of 0.51, while removing global or local components resulted in 0.47 and 0.44, respectively.
- The optimal window size () for reweighting was found to be beneficial over smaller or overly large sampling ranges.
Conclusions:
The DR-ICL framework, by combining global differentiated learning (prioritizing many-shot over zero-shot) and local advantage-based reweighting (dynamically adjusting demonstration importance), effectively addresses the challenges of suboptimal objectives and data noise in many-shot ICL. The introduction of the MICLB dataset provides a valuable resource for future research in this area. The experiments show that DR-ICL leads to improved and more stable performance for LLMs in scenarios with a large number of in-context demonstrations.