MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code (2410.08196v1)
Abstract: Code has been shown to be effective in enhancing the mathematical reasoning abilities of LLMs due to its precision and accuracy. Previous works involving continued mathematical pretraining often include code that utilizes math-related packages, which are primarily designed for fields such as engineering, machine learning, signal processing, or module testing, rather than being directly focused on mathematical reasoning. In this paper, we introduce a novel method for generating mathematical code accompanied with corresponding reasoning steps for continued pretraining. Our approach begins with the construction of a high-quality mathematical continued pretraining dataset by incorporating math-related web data, code using mathematical packages, math textbooks, and synthetic data. Next, we construct reasoning steps by extracting LaTeX expressions, the conditions needed for the expressions, and the results of the expressions from the previously collected dataset. Based on this extracted information, we generate corresponding code to accurately capture the mathematical reasoning process. Appending the generated code to each reasoning step results in data consisting of paired natural language reasoning steps and their corresponding code. Combining this data with the original dataset results in a 19.2B-token high-performing mathematical pretraining corpus, which we name MathCode-Pile. Training several popular base models with this corpus significantly improves their mathematical abilities, leading to the creation of the MathCoder2 family of models. All of our data processing and training code is open-sourced, ensuring full transparency and easy reproducibility of the entire data collection and training pipeline. The code is released at https://github.com/mathLLM/MathCoder2 .
- Llemma: An open language model for mathematics, 2024. URL https://arxiv.org/abs/2310.10631.
- Internlm2 technical report, 2024.
- Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168.
- The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
- Query of cc: Unearthing large scale domain-specific knowledge from public corpora, 2024. URL https://arxiv.org/abs/2401.14624.
- A framework for few-shot language model evaluation, 07 2024. URL https://zenodo.org/records/12608602.
- Tora: A tool-integrated reasoning agent for mathematical problem solving, 2024. URL https://arxiv.org/abs/2309.17452.
- Textbooks are all you need, 2023. URL https://arxiv.org/abs/2306.11644.
- Measuring massive multitask language understanding, 2021a. URL https://arxiv.org/abs/2009.03300.
- Measuring mathematical problem solving with the math dataset, 2021b. URL https://arxiv.org/abs/2103.03874.
- Mistral 7b, 2023. URL https://arxiv.org/abs/2310.06825.
- Bag of tricks for efficient text classification, 2016. URL https://arxiv.org/abs/1607.01759.
- Solving quantitative reasoning problems with language models, 2022. URL https://arxiv.org/abs/2206.14858.
- Starcoder: may the source be with you!, 2023. URL https://arxiv.org/abs/2305.06161.
- Mario: Math reasoning with code interpreter output – a reproducible pipeline, 2024. URL https://arxiv.org/abs/2401.08190.
- Let’s verify step by step, 2023. URL https://arxiv.org/abs/2305.20050.
- Mathgenie: Generating synthetic data with question back-translation for enhancing mathematical reasoning of llms, 2024a. URL https://arxiv.org/abs/2402.16352.
- Step-controlled dpo: Leveraging stepwise error for enhanced mathematical reasoning, 2024b. URL https://arxiv.org/abs/2407.00782.
- Openwebmath: An open dataset of high-quality mathematical web text, 2023. URL https://arxiv.org/abs/2310.06786.
- Direct preference optimization: Your language model is secretly a reward model, 2024. URL https://arxiv.org/abs/2305.18290.
- Code llama: Open foundation models for code, 2024. URL https://arxiv.org/abs/2308.12950.
- Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300.
- Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv.org/abs/2307.09288.
- Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning, 2023a. URL https://arxiv.org/abs/2310.03731.
- Math-shepherd: Verify and reinforce llms step-by-step without human annotations, 2024. URL https://arxiv.org/abs/2312.08935.
- Generative ai for math: Part i – mathpile: A billion-token-scale pretraining corpus for math, 2023b. URL https://arxiv.org/abs/2312.17120.
- Chatglm-math: Improving math problem-solving in large language models with a self-critique pipeline, 2024. URL https://arxiv.org/abs/2404.02893.
- Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024a.
- Qwen2.5-math technical report: Toward mathematical expert model via self-improvement, 2024b. URL https://arxiv.org/abs/2409.12122.
- Gpt can solve mathematical problems without a calculator, 2023. URL https://arxiv.org/abs/2309.03241.
- Synthetic continued pretraining, 2024c. URL https://arxiv.org/abs/2409.07431.
- Internlm-math: Open math large language models toward verifiable reasoning, 2024.
- Metamath: Bootstrap your own mathematical questions for large language models, 2024. URL https://arxiv.org/abs/2309.12284.
- Scaling relationship on learning mathematical reasoning with large language models, 2023. URL https://arxiv.org/abs/2308.01825.
- Mammoth: Building math generalist models through hybrid instruction tuning, 2023. URL https://arxiv.org/abs/2309.05653.
- Mammoth2: Scaling instructions from the web, 2024. URL https://arxiv.org/abs/2405.03548.
- Map-neo: Highly capable and transparent bilingual large language model series, 2024. URL https://arxiv.org/abs/2405.19327.
- Solving challenging math word problems using GPT-4 code interpreter with code-based self-verification. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=c8McWs4Av0.
- Zimu Lu (10 papers)
- Aojun Zhou (45 papers)
- Ke Wang (531 papers)
- Houxing Ren (16 papers)
- Weikang Shi (9 papers)
- Junting Pan (30 papers)
- Mingjie Zhan (23 papers)
- Hongsheng Li (340 papers)