Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Careful Examination of Large Language Model Performance on Grade School Arithmetic (2405.00332v4)

Published 1 May 2024 in cs.CL, cs.AI, and cs.LG
A Careful Examination of Large Language Model Performance on Grade School Arithmetic

Abstract: LLMs have achieved impressive success on many benchmarks for mathematical reasoning. However, there is growing concern that some of this performance actually reflects dataset contamination, where data closely resembling benchmark questions leaks into the training data, instead of true reasoning ability. To investigate this claim rigorously, we commission Grade School Math 1000 (GSM1k). GSM1k is designed to mirror the style and complexity of the established GSM8k benchmark, the gold standard for measuring elementary mathematical reasoning. We ensure that the two benchmarks are comparable across important metrics such as human solve rates, number of steps in solution, answer magnitude, and more. When evaluating leading open- and closed-source LLMs on GSM1k, we observe accuracy drops of up to 8%, with several families of models showing evidence of systematic overfitting across almost all model sizes. Further analysis suggests a positive relationship (Spearman's r2 = 0.36) between a model's probability of generating an example from GSM8k and its performance gap between GSM8k and GSM1k, suggesting that some models may have partially memorized GSM8k. Nevertheless, many models, especially those on the frontier, show minimal signs of overfitting, and all models broadly demonstrate generalization to novel math problems guaranteed to not be in their training data.

Understanding Overfitting in LLMs through the GSM1k Benchmark

Introduction

The creation of the GSM1k benchmark seeks to address significant concerns in the AI research community regarding the genuine capabilities of LLMs. These concerns primarily revolve around whether the impressive performance of these models on existing mathematical benchmarks stems from actual reasoning or merely replicating answers from contaminated datasets. Let's dive deeper into what was uncovered.

Unveiling GSM1k: A New Benchmark

GSM1k serves as a fresh set of grade-school level mathematical problems, designed to parallel the well-known GSM8k benchmark in style and complexity yet created without using any LLMs to avoid data duplication. It comprises 1250 carefully crafted problems meant to evaluate the real reasoning capabilities of various LLMs.

  • Model Evaluation: The paper tested both open-source and proprietary models on GSM1k, including well-known ones like GPT-4, Gemini, and Claude, among others.
  • Key Findings: There were notable drops in accuracy, up to 13%, particularly in models like the Phi and Mistral families, indicating possible overfitting when compared to performances on GSM8k.
  • Contrasting Performances: While some model families exhibited signs of overfitting, leading-edge models (e.g., Gemini/GPT/Claude) showed minimal to none, suggesting more robust generalization capabilities.

The Indicator of Overfitting

The research pinpointed a significant indicator of overfitting through a statistical analysis technique:

  • Probability Relationship: There is a positive correlation indicated by Spearman's r2=0.32r^2=0.32 between a model's likelihood of regenerating examples from GSM8k and its variance in performance on GSM1k compared to GSM8k. This suggests a partial memorization of GSM8k within many models, a sign of overfitting.

Implications and Future Predictions

  • Practical Implications: Recognizing overfit models and understanding their limitations can lead to more honest assessments of LLM capabilities and guide more efficient use of resources in model training and development.
  • Theoretical Advances: These findings push the understanding of "generalization" within AI, prompting more rigorous testing environments that better measure true model capability beyond memorized data.
  • Future of AI Benchmarks: The paper proposes a not-yet-public release of GSM1k to avoid further contamination. The future could see similar controlled releases guiding the development of more challenging and contamination-free benchmarks.

Model Capabilities Beyond Overfitting

Interestingly, the paper also highlights an essential nuance in the debate on AI's reasoning abilities:

  • Generalization Skills: Despite reductions in performance metrics due to potential overfitting, models like Phi and Mistral still perform significantly well on GSM1k, suggesting they retain a strong capability to generalize beyond memorized data.

In conclusion, while the research from GSM1k brings to light the serious issue of overfitting in evaluating LLMs, it also presents a complex but hopeful view of the potential for these models to develop genuine reasoning abilities. The trajectory for future research and development, spurred by findings like these, likely holds both enhanced model training methods and more robust benchmarking tools that can accurately measure and foster true AI capabilities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone, April 2024. URL http://arxiv.org/abs/2404.14219. arXiv:2404.14219 [cs].
  2. Llemma: An Open Language Model For Mathematics, March 2024. URL http://arxiv.org/abs/2310.10631. arXiv:2310.10631 [cs].
  3. Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs, February 2024. URL http://arxiv.org/abs/2402.03927. arXiv:2402.03927 [cs].
  4. GPT-NeoX-20B: An Open-Source Autoregressive Language Model, April 2022. URL http://arxiv.org/abs/2204.06745. arXiv:2204.06745 [cs].
  5. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  6. Quantifying Memorization Across Neural Language Models, March 2023. URL http://arxiv.org/abs/2202.07646. arXiv:2202.07646 [cs].
  7. Evaluating Large Language Models Trained on Code, July 2021. URL http://arxiv.org/abs/2107.03374. arXiv:2107.03374 [cs].
  8. Training Verifiers to Solve Math Word Problems, November 2021. URL http://arxiv.org/abs/2110.14168. arXiv:2110.14168 [cs].
  9. A framework for few-shot language model evaluation, December 2023a. URL https://zenodo.org/records/10256836. tex.version: v0.4.0.
  10. PAL: Program-aided Language Models, January 2023b. URL http://arxiv.org/abs/2211.10435. arXiv:2211.10435 [cs].
  11. Textbooks Are All You Need, October 2023. URL http://arxiv.org/abs/2306.11644. arXiv:2306.11644 [cs].
  12. Measuring Massive Multitask Language Understanding, January 2021a. URL http://arxiv.org/abs/2009.03300. arXiv:2009.03300 [cs].
  13. Measuring Mathematical Problem Solving with the MATH Dataset. NeurIPS, 2021b.
  14. Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5075–5084, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.308. URL https://aclanthology.org/2023.emnlp-main.308.
  15. Mistral 7B, October 2023. URL http://arxiv.org/abs/2310.06825. arXiv:2310.06825 [cs].
  16. Mixtral of Experts, January 2024. URL http://arxiv.org/abs/2401.04088. arXiv:2401.04088 [cs].
  17. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?, April 2024. URL http://arxiv.org/abs/2310.06770. arXiv:2310.06770 [cs].
  18. Data Contamination: From Memorization to Exploitation. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 157–165, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-short.18. URL https://aclanthology.org/2022.acl-short.18.
  19. GPT-4 Technical Report, March 2024. URL http://arxiv.org/abs/2303.08774. arXiv:2303.08774 [cs].
  20. Language Models are Unsupervised Multitask Learners. page 24, 2019.
  21. Do ImageNet Classifiers Generalize to ImageNet?, June 2019. URL http://arxiv.org/abs/1902.10811. arXiv:1902.10811 [cs, stat].
  22. GPQA: A Graduate-Level Google-Proof Q&A Benchmark, November 2023. URL https://arxiv.org/abs/2311.12022v1.
  23. NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10776–10787, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.722. URL https://aclanthology.org/2023.findings-emnlp.722.
  24. Detecting Pretraining Data from Large Language Models, March 2024. URL http://arxiv.org/abs/2310.16789. arXiv:2310.16789 [cs].
  25. Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap, February 2024. URL http://arxiv.org/abs/2402.19450. arXiv:2402.19450 [cs].
  26. Gemini: A Family of Highly Capable Multimodal Models, April 2024. URL http://arxiv.org/abs/2312.11805. arXiv:2312.11805 [cs].
  27. LLaMA: Open and Efficient Foundation Language Models, February 2023a. URL http://arxiv.org/abs/2302.13971. arXiv:2302.13971 [cs].
  28. Llama 2: Open Foundation and Fine-Tuned Chat Models, July 2023b. URL http://arxiv.org/abs/2307.09288. arXiv:2307.09288 [cs].
  29. MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models, October 2023. URL http://arxiv.org/abs/2309.12284. arXiv:2309.12284 [cs].
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (15)
  1. Hugh Zhang (13 papers)
  2. Jeff Da (10 papers)
  3. Dean Lee (104 papers)
  4. Vaughn Robinson (3 papers)
  5. Catherine Wu (2 papers)
  6. Will Song (3 papers)
  7. Tiffany Zhao (2 papers)
  8. Pranav Raja (5 papers)
  9. Dylan Slack (17 papers)
  10. Qin Lyu (3 papers)
  11. Sean Hendryx (12 papers)
  12. Russell Kaplan (5 papers)
  13. Summer Yue (12 papers)
  14. Michele Lunati (1 paper)
  15. Charlotte Zhuang (1 paper)
Citations (61)
Youtube Logo Streamline Icon: https://streamlinehq.com