Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MathScale: Scaling Instruction Tuning for Mathematical Reasoning (2403.02884v1)

Published 5 Mar 2024 in cs.CL, cs.AI, and cs.LG
MathScale: Scaling Instruction Tuning for Mathematical Reasoning

Abstract: LLMs have demonstrated remarkable capabilities in problem-solving. However, their proficiency in solving mathematical problems remains inadequate. We propose MathScale, a simple and scalable method to create high-quality mathematical reasoning data using frontier LLMs (e.g., {\tt GPT-3.5}). Inspired by the cognitive mechanism in human mathematical learning, it first extracts topics and knowledge points from seed math questions and then build a concept graph, which is subsequently used to generate new math questions. MathScale exhibits effective scalability along the size axis of the math dataset that we generate. As a result, we create a mathematical reasoning dataset (MathScaleQA) containing two million math question-answer pairs. To evaluate mathematical reasoning abilities of LLMs comprehensively, we construct {\sc MwpBench}, a benchmark of Math Word Problems, which is a collection of ten datasets (including GSM8K and MATH) covering K-12, college, and competition level math problems. We apply MathScaleQA to fine-tune open-source LLMs (e.g., LLaMA-2 and Mistral), resulting in significantly improved capabilities in mathematical reasoning. Evaluated on {\sc MwpBench}, MathScale-7B achieves state-of-the-art performance across all datasets, surpassing its best peers of equivalent size by 42.9\% in micro average accuracy and 43.7\% in macro average accuracy, respectively.

Insights into "MathScale: Scaling Instruction Tuning for Mathematical Reasoning"

The paper "MathScale: Scaling Instruction Tuning for Mathematical Reasoning" presents a methodological approach to enhancing the mathematical reasoning abilities of LLMs through the creation of a scalable and effective dataset. This approach leverages frontier LLMs, like GPT-3.5, to generate high-quality mathematical reasoning data, thereby addressing the limitations imposed by existing datasets like GSM8K and MATH.

MathScale adopts a novel data generation pipeline, reflecting cognitive mechanisms observed in human learners. The critical stages involve topic and concept extraction from seed questions, constructing concept graphs, and synthesizing new math questions from these graphs. This methodology significantly expands the volume of training data, enabling the construction of the MathScaleQA dataset, comprising two million question-answer pairs. The process efficiently decouples data generation from the constraints of limited existing datasets.

The effectiveness of MathScale is evaluated using M WP B ENCH , a benchmark spanning K-12 to competition-level math problems, enabling consistent and fair model comparisons. MathScale-7B, a model fine-tuned using the MathScaleQA dataset, significantly outperforms baseline models of equivalent size, achieving 42.9% higher micro-average accuracy and 43.7% higher macro-average accuracy. These improvements illustrate the scalability and superiority of the approach, as the MathScale-7B surpasses its peers by a substantial margin.

A key facet is its concept graph framework, derived from the mathematical principles of concept compression and connection forging. This aligns with Tall's theory of mathematical learning, suggesting parallels between the MathScale pipeline and effective human learning strategies. By emphasizing the extraction of both "topics" and "knowledge points," MathScale generates a more diverse dataset, which is crucial for better model generalization.

Although the work primarily focuses on natural language reasoning, its implications hint at opportunities for incorporating program-based tool usage, akin to methods like ToRA. However, the current scope is limited to natural language reasoning without integrating tool-based reasoning, reserving this exploration for future developments.

The research points to notable scalability in enhancing large-scale mathematical datasets. The inclusion of diverse mathematical concepts and the effective transformation of raw seed data into structured information positions MathScale favorably as a foundation for future research. It opens avenues for exploring the LLaMA-2 70B model, extending the dataset size beyond two million examples, and the potential of integrating programming tools for comprehensive reasoning capabilities.

In conclusion, MathScale exemplifies a strategic approach to data augmentation in AI-driven mathematical reasoning. While it showcases profound accuracy improvements, the research encourages further exploration into combining cognitive and computational strategies for a truly holistic instruction-tuning approach.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022.
  2. Generative ai for math: Abel. https://github.com/GAIR-NLP/abel, 2023.
  3. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  4. Corral, M. CORRAL’S VECTOR CALCULUS. 2008.
  5. Probability and statistics: The science of uncertainty. Macmillan, 2004.
  6. PAL: Program-aided language models. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  10764–10799. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/gao23f.html.
  7. Tora: A tool-integrated reasoning agent for mathematical problem solving. arXiv preprint arXiv:2309.17452, 2023.
  8. Grinstead and Snell’s introduction to probability. Chance Project, 2006.
  9. Guichard, D. Calculus. 2009.
  10. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
  11. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  12. Sequence-level knowledge distillation. In Su, J., Duh, K., and Carreras, X. (eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp.  1317–1327, Austin, Texas, November 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1139. URL https://aclanthology.org/D16-1139.
  13. A First Course in Linear Algebra, 2017A version (Lyryx). Lyryx, 2017.
  14. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023.
  15. Selinger, P. Matrix theory and linear algebra, 2018. URL https://www.mathstat.dal.ca/~selinger/linear-algebra/. An introduction to linear algebra for first or second year university students. Licensed under Creative Commons CC BY 4.0 License. Last updated on October 26, 2018.
  16. Precalculus. Stitz Zeager Open Source Mathematics, 2013.
  17. TAL. Tal-scq5k, 2023. URL https://github.com/math-eval/TAL-SCQ5K. GitHub repository.
  18. Tall, D. How humans learn to think mathematically: Exploring the three worlds of mathematics. Cambridge University Press, 2013.
  19. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  20. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  21. Trench, W. F. Elementary Differential Equations. Brooks/Cole Thomson Learning, San Antonio, Texas, USA, 2001. URL http://ramanujan.math.trinity.edu/wtrench/texts/TRENCH_DIFF_EQNS_I.PDF. Free Edition 1.01 (December 2013).
  22. Wallace, T. Beginning and intermediate algebra. 2010.
  23. Deep neural solver for math word problems. In Proceedings of the 2017 conference on empirical methods in natural language processing, pp.  845–854, 2017.
  24. How far can camels go? exploring the state of instruction tuning on open resources. arXiv preprint arXiv:2306.04751, 2023.
  25. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  26. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  27. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  28. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023.
  29. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653, 2023.
  30. Evaluating the performance of large language models on gaokao benchmark. 2023.
  31. Ape210k: A large-scale and template-rich dataset of math word problems. arXiv preprint arXiv:2009.11506, 2020.
  32. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Zhengyang Tang (13 papers)
  2. Xingxing Zhang (65 papers)
  3. Furu Wei (291 papers)
  4. Benyou Wang (109 papers)
Citations (26)