Cause of prompt-induced reduction in the GSM8k–GSM1k performance gap
Determine the causal mechanism by which using the LM Evaluation Harness “chain-of-thought” alternative prompt that employs non-GSM8k n-shot examples reduces the observed GSM8k–GSM1k performance gap across models and, for some models such as Mixtral-8x22B-v0.1, dramatically so. Ascertain whether this effect is driven by activation of GSM8k memorization when GSM8k-based n-shot examples are used, versus other factors such as answer formatting or prompt structure.
References
However, for some models (e.g. Mixtral-8x22B-v0.1), this reduces the amount of observed overfitting dramatically. While the exact cause of this difference is impossible to know, especially without access to model details such as their training set, our hypothesis is that prompting a model with GSM8k is more likely to activate the ``memorization'' portion of a model than if it is prompted by non-GSM8k grade school math problems.