More is not always better? Enhancing Many-Shot In-Context Learning with Differentiated and Reweighting Objectives (2501.04070v1)

Published 7 Jan 2025 in cs.LG, cs.AI, and cs.CL

Abstract: LLMs excel at few-shot in-context learning (ICL) without requiring parameter updates. However, as the number of ICL demonstrations increases from a few to many, performance tends to plateau and eventually decline. We identify two primary causes for this trend: the suboptimal negative log-likelihood (NLL) optimization objective and the incremental data noise. To address these issues, we introduce DR-ICL, a novel optimization method that enhances model performance through Differentiated Learning and advantage-based Reweighting objectives. Globally, DR-ICL utilizes differentiated learning to optimize the NLL objective, ensuring that many-shot performance surpasses zero-shot levels. Locally, it dynamically adjusts the weighting of many-shot demonstrations by leveraging cumulative advantages inspired by reinforcement learning, thereby improving generalization. This approach allows the model to handle varying numbers of shots effectively, mitigating the impact of noisy data. Recognizing the lack of multi-task datasets with diverse many-shot distributions, we develop the Many-Shot ICL Benchmark (MICLB)-a large-scale benchmark covering shot numbers from 1 to 350 within sequences of up to 8,000 tokens-for fine-tuning purposes. MICLB facilitates the evaluation of many-shot ICL strategies across seven prominent NLP tasks and 50 distinct datasets. Experimental results demonstrate that LLMs enhanced with DR-ICL achieve significant improvements in many-shot setups across various tasks, including both in-domain and out-of-domain scenarios. We release the code and benchmark dataset hoping to facilitate further research in many-shot ICL.

PDF Abstract

This paper addresses the performance degradation of LLMs in many-shot in-context learning (ICL) scenarios, where providing a large number of demonstration examples (shots) can lead to a plateau or even a decline in performance. The authors identify two primary causes: a suboptimal negative log-likelihood (NLL) optimization objective and increasing data noise with more demonstrations.

To tackle these issues, the paper introduces DR-ICL, a novel fine-tuning optimization method that enhances many-shot ICL through Differentiated Learning and Advantage-based Reweighting objectives.

Global Perspective: Differentiated Learning:
- This component aims to ensure that the model's performance with many-shot demonstrations is superior to its zero-shot performance.
- During training, the model is optimized on both many-shot sequences ( $S_K = \{I; x_1y_1 \ldots x_Ky_K\}$ ) and their corresponding zero-shot versions (derived using Parallel Context Windows, PCW, to mask context).
- The differentiated loss function is: $\mathcal{L}_{\text{diff}} = (1 + \alpha) \ast \mathcal{L}_{\text{many-shot}} + (1 - \alpha) \ast \mathcal{L}_{\text{zero-shot}}$ , where $\alpha$ is a hyperparameter. The goal is to make $\mathcal{L}_{\text{many-shot}} < \mathcal{L}_{\text{zero-shot}}$ .
Local Perspective: Advantage-based Reweighting:
- This component dynamically adjusts the weights of individual demonstrations within a many-shot sequence to mitigate the impact of noisy or less informative examples. It is inspired by reinforcement learning's advantage function.
- Importance Sampling: The sequence of $K$ demonstrations is divided into reweighting windows of size $W$ . For each demonstration $x_k$ in a reweighting window $w$ , the preceding window $w-1$ acts as a sampling window. Importance weights are calculated based on the NLL loss of demonstrations ( $\mathcal{L}_{\text{many-shot}_k}$ ) to select $|S|$ most significant demonstrations from the sampling window.
- Advantage Functions: The reward $\mathcal{R}_k$ for a demonstration $x_k$ is calculated as the difference between its loss $\mathcal{L}_{\text{many-shot}_k}$ and the average loss of the sampled demonstrations from the previous window $\mathcal{L}_{\text{sampling}_{w-1}}$ : $\mathcal{R}_k = \mathcal{L}_{\text{many-shot}_k} - \mathcal{L}_{\text{sampling}_{w-1}}$ .
- The cumulative advantage $\mathcal{A}_k$ is then computed as $\mathcal{A}_k = \exp(\mathcal{R}_k / \gamma)$ , where $\gamma$ is a temperature parameter. This amplifies positive rewards (performance improvements).
- Reweighting: The NLL loss for many-shot demonstrations is then reweighted: $\mathcal{L}_{\text{many-shot}} = \frac{1}{K} \sum_k (\mathcal{L}_{\text{many-shot}_k} \ast \mathcal{A}_k)$ .

Recognizing the lack of suitable benchmarks for many-shot ICL fine-tuning, the authors developed the Many-Shot ICL Benchmark (MICLB).

MICLB is a large-scale benchmark covering shot numbers from 1 to 350 within sequences up to 8,000 tokens.
It includes 50 distinct datasets across 7 prominent NLP task types (QA, Reasoning, Summarization, Clustering, Classification, Retrieval, Reranking), totaling over three million samples.
It facilitates evaluation of many-shot ICL strategies for both in-domain and out-of-domain scenarios.

Experimental Setup:

Base Models: Llama-2-7b-chat-hf and Mistral-7B-Instruct-v0.2.
Baselines:
- NFT (No Fine-tuning): Foundational models.
- IT (Instruction Tuning): Fine-tuning with zero-shot examples.
- MetaICL: Fine-tuning with many-shot examples.
Evaluation Metrics: Accuracy (for QA, clustering, logical reasoning, classification, retrieval), Distinct-3/ROUGE-1/BLEU-1 (for summarization), P@k/R@k/G@k (for reranking).
Hyperparameters: $\alpha$ (0.2 for Llama-2, 0.4 for Mistral), $\gamma=11$ , sampling size $|S|=1$ , reweighting window size $W=10$ .

Key Results:

DR-ICL demonstrated significant improvements in many-shot setups across various tasks, for both in-domain and out-of-domain datasets, compared to NFT, IT, and MetaICL.
It showed more stable performance as the number of shots ( $k$ ) increased, mitigating the rise-and-fall trend observed in other methods. For instance, on CLSClusteringS2S, DR-ICL maintained high performance with increasing $k$ , while IT's performance degraded and MetaICL showed fluctuations.
DR-ICL achieved lower performance variance across different $k$ -shot values (average variance of 1.56E-03 compared to MetaICL's 2.38E-03).
The advantage-based reweighting mechanism effectively reduced sensitivity to data noise, leading to more stable loss convergence during training.
Ablation studies confirmed the contributions of both the global differentiated learning and local reweighting components. For example, on WinoWhy, DR-ICL achieved an average accuracy of 0.51, while removing global or local components resulted in 0.47 and 0.44, respectively.
The optimal window size ( $W=10$ ) for reweighting was found to be beneficial over smaller or overly large sampling ranges.

Conclusions:

The DR-ICL framework, by combining global differentiated learning (prioritizing many-shot over zero-shot) and local advantage-based reweighting (dynamically adjusting demonstration importance), effectively addresses the challenges of suboptimal objectives and data noise in many-shot ICL. The introduction of the MICLB dataset provides a valuable resource for future research in this area. The experiments show that DR-ICL leads to improved and more stable performance for LLMs in scenarios with a large number of in-context demonstrations.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Xiaoqing Zhang (30 papers)
Ang Lv (19 papers)
Yuhan Liu (103 papers)
Flood Sung (13 papers)
Wei Liu (1135 papers)
Shuo Shang (30 papers)
Xiuying Chen (80 papers)
Rui Yan (250 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/rohanpaul_ai/status/1878007775785279646

https://twitter.com/gastronomy/status/1877545306075636195

HackerNews

Enhancing In-Context Learning with Differentiated and Reweighting Objectives (2 points, 0 comments)