Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code (2410.08196v1)

Published 10 Oct 2024 in cs.CL, cs.AI, and cs.CV

Abstract: Code has been shown to be effective in enhancing the mathematical reasoning abilities of LLMs due to its precision and accuracy. Previous works involving continued mathematical pretraining often include code that utilizes math-related packages, which are primarily designed for fields such as engineering, machine learning, signal processing, or module testing, rather than being directly focused on mathematical reasoning. In this paper, we introduce a novel method for generating mathematical code accompanied with corresponding reasoning steps for continued pretraining. Our approach begins with the construction of a high-quality mathematical continued pretraining dataset by incorporating math-related web data, code using mathematical packages, math textbooks, and synthetic data. Next, we construct reasoning steps by extracting LaTeX expressions, the conditions needed for the expressions, and the results of the expressions from the previously collected dataset. Based on this extracted information, we generate corresponding code to accurately capture the mathematical reasoning process. Appending the generated code to each reasoning step results in data consisting of paired natural language reasoning steps and their corresponding code. Combining this data with the original dataset results in a 19.2B-token high-performing mathematical pretraining corpus, which we name MathCode-Pile. Training several popular base models with this corpus significantly improves their mathematical abilities, leading to the creation of the MathCoder2 family of models. All of our data processing and training code is open-sourced, ensuring full transparency and easy reproducibility of the entire data collection and training pipeline. The code is released at https://github.com/mathLLM/MathCoder2 .

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Llemma: An open language model for mathematics, 2024. URL https://arxiv.org/abs/2310.10631.
  2. Internlm2 technical report, 2024.
  3. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168.
  4. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
  5. Query of cc: Unearthing large scale domain-specific knowledge from public corpora, 2024. URL https://arxiv.org/abs/2401.14624.
  6. A framework for few-shot language model evaluation, 07 2024. URL https://zenodo.org/records/12608602.
  7. Tora: A tool-integrated reasoning agent for mathematical problem solving, 2024. URL https://arxiv.org/abs/2309.17452.
  8. Textbooks are all you need, 2023. URL https://arxiv.org/abs/2306.11644.
  9. Measuring massive multitask language understanding, 2021a. URL https://arxiv.org/abs/2009.03300.
  10. Measuring mathematical problem solving with the math dataset, 2021b. URL https://arxiv.org/abs/2103.03874.
  11. Mistral 7b, 2023. URL https://arxiv.org/abs/2310.06825.
  12. Bag of tricks for efficient text classification, 2016. URL https://arxiv.org/abs/1607.01759.
  13. Solving quantitative reasoning problems with language models, 2022. URL https://arxiv.org/abs/2206.14858.
  14. Starcoder: may the source be with you!, 2023. URL https://arxiv.org/abs/2305.06161.
  15. Mario: Math reasoning with code interpreter output – a reproducible pipeline, 2024. URL https://arxiv.org/abs/2401.08190.
  16. Let’s verify step by step, 2023. URL https://arxiv.org/abs/2305.20050.
  17. Mathgenie: Generating synthetic data with question back-translation for enhancing mathematical reasoning of llms, 2024a. URL https://arxiv.org/abs/2402.16352.
  18. Step-controlled dpo: Leveraging stepwise error for enhanced mathematical reasoning, 2024b. URL https://arxiv.org/abs/2407.00782.
  19. Openwebmath: An open dataset of high-quality mathematical web text, 2023. URL https://arxiv.org/abs/2310.06786.
  20. Direct preference optimization: Your language model is secretly a reward model, 2024. URL https://arxiv.org/abs/2305.18290.
  21. Code llama: Open foundation models for code, 2024. URL https://arxiv.org/abs/2308.12950.
  22. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300.
  23. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv.org/abs/2307.09288.
  24. Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning, 2023a. URL https://arxiv.org/abs/2310.03731.
  25. Math-shepherd: Verify and reinforce llms step-by-step without human annotations, 2024. URL https://arxiv.org/abs/2312.08935.
  26. Generative ai for math: Part i – mathpile: A billion-token-scale pretraining corpus for math, 2023b. URL https://arxiv.org/abs/2312.17120.
  27. Chatglm-math: Improving math problem-solving in large language models with a self-critique pipeline, 2024. URL https://arxiv.org/abs/2404.02893.
  28. Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024a.
  29. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, 2024b. URL https://arxiv.org/abs/2409.12122.
  30. Gpt can solve mathematical problems without a calculator, 2023. URL https://arxiv.org/abs/2309.03241.
  31. Synthetic continued pretraining, 2024c. URL https://arxiv.org/abs/2409.07431.
  32. Internlm-math: Open math large language models toward verifiable reasoning, 2024.
  33. Metamath: Bootstrap your own mathematical questions for large language models, 2024. URL https://arxiv.org/abs/2309.12284.
  34. Scaling relationship on learning mathematical reasoning with large language models, 2023. URL https://arxiv.org/abs/2308.01825.
  35. Mammoth: Building math generalist models through hybrid instruction tuning, 2023. URL https://arxiv.org/abs/2309.05653.
  36. Mammoth2: Scaling instructions from the web, 2024. URL https://arxiv.org/abs/2405.03548.
  37. Map-neo: Highly capable and transparent bilingual large language model series, 2024. URL https://arxiv.org/abs/2405.19327.
  38. Solving challenging math word problems using GPT-4 code interpreter with code-based self-verification. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=c8McWs4Av0.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Zimu Lu (10 papers)
  2. Aojun Zhou (45 papers)
  3. Ke Wang (531 papers)
  4. Houxing Ren (16 papers)
  5. Weikang Shi (9 papers)
  6. Junting Pan (30 papers)
  7. Mingjie Zhan (23 papers)
  8. Hongsheng Li (340 papers)

Summary

  • The paper demonstrates a novel two-stage pretraining approach that integrates curated math texts with model-translated code to enhance LLMs' reasoning abilities.
  • The methodology employs MathCode-Pile, combining 11.2B tokens of math-related text and 19.2B tokens of paired code, to systematically capture mathematical reasoning processes.
  • Experimental results highlight significant benchmark gains, with MathCoder2-Llama-3-8B achieving 38.4% on MATH and 69.9% on GSM8K tests.

Insights on MathCoder2: Enhancing Mathematical Reasoning in LLMs

The paper "MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code" presents a sophisticated approach to refining the mathematical reasoning capabilities of LLMs through a methodically curated dataset and innovative code generation. This research introduces a novel pretraining corpus, MathCode-Pile, which addresses existing gaps in current methodologies for enhancing mathematical proficiency in LLMs by integrating natural language reasoning with mathematical computations.

Methodology

The authors propose a two-step data curation pipeline beginning with the assembly of diverse math-related texts, including web content, synthetic data, code, and textbooks. The text data is filtered meticulously, leveraging fastText classifiers to include only highly relevant mathematical content. This initial dataset ensures a solid foundation, incorporating 11.2B tokens of math-related web data among others, to provide comprehensive coverage of mathematical topics.

The innovative aspect of this paper lies in its second stage: the generation of paired mathematical code and reasoning steps. By extracting LaTeX expressions from the curated dataset and generating Python code snippets, the authors enable the LLMs to better understand and replicate mathematical reasoning processes. This combination results in the MathCode-Pile, integrating 19.2B tokens designed to systematically improve the mathematical reasoning abilities of LLMs.

Experimental Evaluation

For empirical evaluation, MathCode-Pile was used to continue pretraining on several popular models, namely Llama-3-8B, DeepSeekMath-7B, Mistral-7B, and Code-Llama-7B. The results demonstrated notable improvements in mathematical benchmarks—5 datasets including GSM8K and MATH saw performance gains after continuing pretraining with MathCode-Pile. For instance, MathCoder2-Llama-3-8B achieved 4-shot accuracies of 38.4% on MATH and 69.9% on GSM8K, reflecting significant enhancements over baseline models.

Notably, the paper highlights that the integration of mathematical code, despite representing a mere 14.1% of the dataset, accounts for a substantial portion of the efficacy gains. This underscores the importance of model-translated code in capturing mathematical reasoning.

Implications and Future Directions

This comprehensive approach, emphasizing the fusion of natural language with computational reasoning, broadens the scope of potential applications in education technology, research, and automated theorem proving. The rigorous methodology not only enhances mathematical abilities but also provides a transparent and reproducible framework for subsequent research.

Future work could extend this methodology to other STEM fields or involve larger models to further enhance capabilities. Moreover, experimenting with different post-training techniques, such as reinforcement learning and direct preference optimization, could yield even more remarkable results on mathematical reasoning tasks.

In conclusion, this research represents a significant contribution to the domain of mathematical reasoning in LLMs, presenting an open-sourced, reproducible framework that stands to facilitate further advancements in AI-driven mathematical problem-solving.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com

Reddit

  1. New math model: MathCoder2 (24 points, 2 comments)