MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code (2410.08196v1)

Published 10 Oct 2024 in cs.CL, cs.AI, and cs.CV

Abstract: Code has been shown to be effective in enhancing the mathematical reasoning abilities of LLMs due to its precision and accuracy. Previous works involving continued mathematical pretraining often include code that utilizes math-related packages, which are primarily designed for fields such as engineering, machine learning, signal processing, or module testing, rather than being directly focused on mathematical reasoning. In this paper, we introduce a novel method for generating mathematical code accompanied with corresponding reasoning steps for continued pretraining. Our approach begins with the construction of a high-quality mathematical continued pretraining dataset by incorporating math-related web data, code using mathematical packages, math textbooks, and synthetic data. Next, we construct reasoning steps by extracting LaTeX expressions, the conditions needed for the expressions, and the results of the expressions from the previously collected dataset. Based on this extracted information, we generate corresponding code to accurately capture the mathematical reasoning process. Appending the generated code to each reasoning step results in data consisting of paired natural language reasoning steps and their corresponding code. Combining this data with the original dataset results in a 19.2B-token high-performing mathematical pretraining corpus, which we name MathCode-Pile. Training several popular base models with this corpus significantly improves their mathematical abilities, leading to the creation of the MathCoder2 family of models. All of our data processing and training code is open-sourced, ensuring full transparency and easy reproducibility of the entire data collection and training pipeline. The code is released at https://github.com/mathLLM/MathCoder2 .

References (38)

Authors (8)

Zimu Lu (10 papers)
Aojun Zhou (45 papers)
Ke Wang (531 papers)
Houxing Ren (16 papers)
Weikang Shi (9 papers)
Junting Pan (30 papers)
Mingjie Zhan (23 papers)
Hongsheng Li (340 papers)

Summary

The paper demonstrates a novel two-stage pretraining approach that integrates curated math texts with model-translated code to enhance LLMs' reasoning abilities.
The methodology employs MathCode-Pile, combining 11.2B tokens of math-related text and 19.2B tokens of paired code, to systematically capture mathematical reasoning processes.
Experimental results highlight significant benchmark gains, with MathCoder2-Llama-3-8B achieving 38.4% on MATH and 69.9% on GSM8K tests.

Insights on MathCoder2: Enhancing Mathematical Reasoning in LLMs

The paper "MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code" presents a sophisticated approach to refining the mathematical reasoning capabilities of LLMs through a methodically curated dataset and innovative code generation. This research introduces a novel pretraining corpus, MathCode-Pile, which addresses existing gaps in current methodologies for enhancing mathematical proficiency in LLMs by integrating natural language reasoning with mathematical computations.

Methodology

The authors propose a two-step data curation pipeline beginning with the assembly of diverse math-related texts, including web content, synthetic data, code, and textbooks. The text data is filtered meticulously, leveraging fastText classifiers to include only highly relevant mathematical content. This initial dataset ensures a solid foundation, incorporating 11.2B tokens of math-related web data among others, to provide comprehensive coverage of mathematical topics.

The innovative aspect of this paper lies in its second stage: the generation of paired mathematical code and reasoning steps. By extracting LaTeX expressions from the curated dataset and generating Python code snippets, the authors enable the LLMs to better understand and replicate mathematical reasoning processes. This combination results in the MathCode-Pile, integrating 19.2B tokens designed to systematically improve the mathematical reasoning abilities of LLMs.

Experimental Evaluation

For empirical evaluation, MathCode-Pile was used to continue pretraining on several popular models, namely Llama-3-8B, DeepSeekMath-7B, Mistral-7B, and Code-Llama-7B. The results demonstrated notable improvements in mathematical benchmarks—5 datasets including GSM8K and MATH saw performance gains after continuing pretraining with MathCode-Pile. For instance, MathCoder2-Llama-3-8B achieved 4-shot accuracies of 38.4% on MATH and 69.9% on GSM8K, reflecting significant enhancements over baseline models.

Notably, the paper highlights that the integration of mathematical code, despite representing a mere 14.1% of the dataset, accounts for a substantial portion of the efficacy gains. This underscores the importance of model-translated code in capturing mathematical reasoning.

Implications and Future Directions

This comprehensive approach, emphasizing the fusion of natural language with computational reasoning, broadens the scope of potential applications in education technology, research, and automated theorem proving. The rigorous methodology not only enhances mathematical abilities but also provides a transparent and reproducible framework for subsequent research.

Future work could extend this methodology to other STEM fields or involve larger models to further enhance capabilities. Moreover, experimenting with different post-training techniques, such as reinforcement learning and direct preference optimization, could yield even more remarkable results on mathematical reasoning tasks.

In conclusion, this research represents a significant contribution to the domain of mathematical reasoning in LLMs, presenting an open-sourced, reproducible framework that stands to facilitate further advancements in AI-driven mathematical problem-solving.

Related Papers

GitHub

GitHub - mathllm/MathCoder2 (4 stars)

Tweets

https://twitter.com/rohanpaul_ai/status/1850365888580714548

https://twitter.com/gm8xx8/status/1845997454442864978

https://twitter.com/chongdashu/status/1845018611288523196

https://twitter.com/arXivGPT/status/1845529090234671143

https://twitter.com/arXivGPT/status/1845891420458307590

https://twitter.com/GptMaestro/status/1848572174069637474

YouTube

Show All Videos

Reddit

New math model: MathCoder2 (24 points, 2 comments)