Dice Question Streamline Icon: https://streamlinehq.com

Cause of prompt-induced reduction in the GSM8k–GSM1k performance gap

Determine the causal mechanism by which using the LM Evaluation Harness “chain-of-thought” alternative prompt that employs non-GSM8k n-shot examples reduces the observed GSM8k–GSM1k performance gap across models and, for some models such as Mixtral-8x22B-v0.1, dramatically so. Ascertain whether this effect is driven by activation of GSM8k memorization when GSM8k-based n-shot examples are used, versus other factors such as answer formatting or prompt structure.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper reports that an alternative prompt, which uses non-GSM8k examples and an altered answer format, generally decreases the GSM8k–GSM1k performance gap by about 1% and can sharply reduce apparent overfitting for certain models. The authors explicitly state that the exact cause of this difference is impossible to know without access to model training details.

They hypothesize that prompting with GSM8k examples might trigger benchmark memorization, whereas non-GSM8k examples may avoid this. Understanding the mechanism would inform the design of robust evaluation protocols that minimize contamination effects and yield more reliable assessments of reasoning ability.

References

However, for some models (e.g. Mixtral-8x22B-v0.1), this reduces the amount of observed overfitting dramatically. While the exact cause of this difference is impossible to know, especially without access to model details such as their training set, our hypothesis is that prompting a model with GSM8k is more likely to activate the ``memorization'' portion of a model than if it is prompted by non-GSM8k grade school math problems.

A Careful Examination of Large Language Model Performance on Grade School Arithmetic (2405.00332 - Zhang et al., 1 May 2024) in Appendix, Section: Results with An Alternative Prompt