Dice Question Streamline Icon: https://streamlinehq.com

Understanding performance degradation with more in-context examples

Determine the causes of the observed performance degradation in many-shot in-context learning when increasing the number of in-context examples in the prompt, with specific focus on the Hendrycks MATH dataset where accuracy declines as shots grow, and explain why negative log-likelihood trends fail to account for this behavior.

Information Square Streamline Icon: https://streamlinehq.com

Background

Across several tasks, the paper reports that many-shot in-context learning generally improves performance, but in some settings (e.g., Hendrycks MATH) accuracy decreases beyond certain shot counts (roughly beyond 125 shots). The authors investigated next-token prediction loss (negative log-likelihood) as a function of context length and found that it continues to decrease even as downstream performance plateaus or declines.

This mismatch suggests that common long-context metrics (e.g., NLL) do not capture the mechanisms driving performance deterioration with larger prompts. Understanding why adding more in-context examples can harm problem-solving accuracy is left unresolved and identified as a limitation, motivating targeted research to explain and remedy the phenomenon.

References

Another limitation of our work is that we don't completely understand why performance can sometimes degrades with more examples in the prompt (for example, for MATH). Our analysis found that negative log-likelihood trends are insufficient to explain this degradation, and future work should focus on investigating new research directions to shed light on the matter.

Many-Shot In-Context Learning (2404.11018 - Agarwal et al., 17 Apr 2024) in Limitations paragraph (following Conclusion)