Dice Question Streamline Icon: https://streamlinehq.com

Cause of minimal overfitting in frontier models

Determine which explanation accounts for the observed minimal overfitting of frontier and near-frontier large language models on GSM8k versus GSM1k: (i) genuine generalization arising from sufficiently advanced reasoning capabilities that enable solving novel problems even if GSM8k appeared in training, or (ii) careful avoidance of data contamination in the models’ training pipelines. Focus on models such as the proprietary Mistral Large, which show no signs of overfitting, to ascertain the dominant factor.

Information Square Streamline Icon: https://streamlinehq.com

Background

In the analysis, the paper observes that frontier or near-frontier models appear to perform similarly on GSM8k and GSM1k, indicating minimal overfitting. The authors propose two concrete hypotheses to explain this: strong reasoning capabilities that generalize beyond memorized benchmarks, or superior contamination controls during training.

The authors note that without access to training data it is impossible to be certain, but they discuss circumstantial evidence such as Mistral Large being the only member of its family without observed overfitting. Resolving this question would clarify whether benchmark robustness in frontier models stems primarily from genuine reasoning or from dataset hygiene practices.

References

We posit two potential hypotheses for this: 1) frontier models have sufficiently advanced reasoning capability so that they can generalize to new problems even if they have already seen GSM8k problems in their training set, 2) frontier model builders may be more careful about data contamination. While it is impossible to know for certain without looking at the training set for each model, one piece of evidence in favor of the former is that Mistral Large is the only model in the Mistral family to show no signs of overfitting.

A Careful Examination of Large Language Model Performance on Grade School Arithmetic (2405.00332 - Zhang et al., 1 May 2024) in Analysis, Lesson 2: Other Models, Especially Frontier Models, Show No Signs of Overfitting