A Careful Examination of Large Language Model Performance on Grade School Arithmetic (2405.00332v4)

Published 1 May 2024 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs have achieved impressive success on many benchmarks for mathematical reasoning. However, there is growing concern that some of this performance actually reflects dataset contamination, where data closely resembling benchmark questions leaks into the training data, instead of true reasoning ability. To investigate this claim rigorously, we commission Grade School Math 1000 (GSM1k). GSM1k is designed to mirror the style and complexity of the established GSM8k benchmark, the gold standard for measuring elementary mathematical reasoning. We ensure that the two benchmarks are comparable across important metrics such as human solve rates, number of steps in solution, answer magnitude, and more. When evaluating leading open- and closed-source LLMs on GSM1k, we observe accuracy drops of up to 8%, with several families of models showing evidence of systematic overfitting across almost all model sizes. Further analysis suggests a positive relationship (Spearman's r² = 0.36) between a model's probability of generating an example from GSM8k and its performance gap between GSM8k and GSM1k, suggesting that some models may have partially memorized GSM8k. Nevertheless, many models, especially those on the frontier, show minimal signs of overfitting, and all models broadly demonstrate generalization to novel math problems guaranteed to not be in their training data.

PDF HTML Abstract

Understanding Overfitting in LLMs through the GSM1k Benchmark

Introduction

The creation of the GSM1k benchmark seeks to address significant concerns in the AI research community regarding the genuine capabilities of LLMs. These concerns primarily revolve around whether the impressive performance of these models on existing mathematical benchmarks stems from actual reasoning or merely replicating answers from contaminated datasets. Let's dive deeper into what was uncovered.

Unveiling GSM1k: A New Benchmark

GSM1k serves as a fresh set of grade-school level mathematical problems, designed to parallel the well-known GSM8k benchmark in style and complexity yet created without using any LLMs to avoid data duplication. It comprises 1250 carefully crafted problems meant to evaluate the real reasoning capabilities of various LLMs.

Model Evaluation: The paper tested both open-source and proprietary models on GSM1k, including well-known ones like GPT-4, Gemini, and Claude, among others.
Key Findings: There were notable drops in accuracy, up to 13%, particularly in models like the Phi and Mistral families, indicating possible overfitting when compared to performances on GSM8k.
Contrasting Performances: While some model families exhibited signs of overfitting, leading-edge models (e.g., Gemini/GPT/Claude) showed minimal to none, suggesting more robust generalization capabilities.

The Indicator of Overfitting

The research pinpointed a significant indicator of overfitting through a statistical analysis technique:

Probability Relationship: There is a positive correlation indicated by Spearman's $r^2=0.32$ between a model's likelihood of regenerating examples from GSM8k and its variance in performance on GSM1k compared to GSM8k. This suggests a partial memorization of GSM8k within many models, a sign of overfitting.

Implications and Future Predictions

Practical Implications: Recognizing overfit models and understanding their limitations can lead to more honest assessments of LLM capabilities and guide more efficient use of resources in model training and development.
Theoretical Advances: These findings push the understanding of "generalization" within AI, prompting more rigorous testing environments that better measure true model capability beyond memorized data.
Future of AI Benchmarks: The paper proposes a not-yet-public release of GSM1k to avoid further contamination. The future could see similar controlled releases guiding the development of more challenging and contamination-free benchmarks.

Model Capabilities Beyond Overfitting

Interestingly, the paper also highlights an essential nuance in the debate on AI's reasoning abilities:

Generalization Skills: Despite reductions in performance metrics due to potential overfitting, models like Phi and Mistral still perform significantly well on GSM1k, suggesting they retain a strong capability to generalize beyond memorized data.

In conclusion, while the research from GSM1k brings to light the serious issue of overfitting in evaluating LLMs, it also presents a complex but hopeful view of the potential for these models to develop genuine reasoning abilities. The trajectory for future research and development, spurred by findings like these, likely holds both enhanced model training methods and more robust benchmarking tools that can accurately measure and foster true AI capabilities.