Cause of minimal overfitting in frontier models
Determine which explanation accounts for the observed minimal overfitting of frontier and near-frontier large language models on GSM8k versus GSM1k: (i) genuine generalization arising from sufficiently advanced reasoning capabilities that enable solving novel problems even if GSM8k appeared in training, or (ii) careful avoidance of data contamination in the models’ training pipelines. Focus on models such as the proprietary Mistral Large, which show no signs of overfitting, to ascertain the dominant factor.
References
We posit two potential hypotheses for this: 1) frontier models have sufficiently advanced reasoning capability so that they can generalize to new problems even if they have already seen GSM8k problems in their training set, 2) frontier model builders may be more careful about data contamination. While it is impossible to know for certain without looking at the training set for each model, one piece of evidence in favor of the former is that Mistral Large is the only model in the Mistral family to show no signs of overfitting.