Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning (2310.03731v1)

Published 5 Oct 2023 in cs.CL, cs.AI, cs.CV, and cs.LG
MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning

Abstract: The recently released GPT-4 Code Interpreter has demonstrated remarkable proficiency in solving challenging math problems, primarily attributed to its ability to seamlessly reason with natural language, generate code, execute code, and continue reasoning based on the execution output. In this paper, we present a method to fine-tune open-source LLMs, enabling them to use code for modeling and deriving math equations and, consequently, enhancing their mathematical reasoning abilities. We propose a method of generating novel and high-quality datasets with math problems and their code-based solutions, referred to as MathCodeInstruct. Each solution interleaves natural language, code, and execution results. We also introduce a customized supervised fine-tuning and inference approach. This approach yields the MathCoder models, a family of models capable of generating code-based solutions for solving challenging math problems. Impressively, the MathCoder models achieve state-of-the-art scores among open-source LLMs on the MATH (45.2%) and GSM8K (83.9%) datasets, substantially outperforming other open-source alternatives. Notably, the MathCoder model not only surpasses ChatGPT-3.5 and PaLM-2 on GSM8K and MATH but also outperforms GPT-4 on the competition-level MATH dataset. The dataset and models will be released at https://github.com/mathLLM/MathCoder.

Mathematical Reasoning with MathCoder: Integrating Code into Open-Source LLMs

This paper presents MathCoder, an effort to enhance the mathematical proficiency of open-source LLMs through the integration of code execution capabilities. The effort is driven by observed limitations in the mathematical reasoning skills of publicly available models compared to their closed-source counterparts like GPT-4. The approach is rooted in the robust performance witnessed in models like GPT-4 Code Interpreter, which effectively marries natural language processing with programmatic execution, thereby enabling superior performance on complex mathematical tasks.

The authors propose a systematic method for fine-tuning open-source models by leveraging the concepts of natural language reasoning, code generation, and execution outputs, thereby elevating their capacity to tackle intricate mathematical problems. MathCoder introduces an innovative dataset, MathCodeInstruct, which combines math problems with their code-driven solutions, formulated in an interleaved sequence of text, code, and execution results. The solutions within this dataset echo the operational paradigm of the GPT-4 Code Interpreter, aiming to simulate its success within the open-source sphere.

In the construction of MathCodeInstruct, the authors employ a dual-phase process. The dataset begins with a "seed" set of problems extracted from the GSM8K and MATH datasets, with solutions generated in the style of the GPT-4 Code Interpreter. The second phase introduces novel problems derived via a creative technique dubbed problem interpolation prompting, which serves to bridge the gap between the difficulty levels of GSM8K and MATH problems. The distilled solutions from MathCoder-Initial, an initial model phase, are used to generate solutions for these new problem sets, thereby augmenting the training corpus with intermediate-level challenges. This method not only enriches the dataset but also enhances the model's generalization capabilities.

MathCoder's fine-tuning process is distinguished by a novel training and inference pipeline using customized tokens to guide the learning of interleaved natural language and code generation. During inference, these tokens facilitate the model's understanding of when to execute code and interpret the ensuing results, ensuring a sequential and logical solution path akin to a human problem solver.

The MathCoder family of models demonstrates impressive results on benchmark datasets, notably reaching 45.2% on the MATH dataset and 83.9% on the GSM8K dataset, surpassing existing open-source benchmarks. The advancements achieved by MathCoder highlight the tremendous potential of integrating programmatic execution within LLMs, setting a new standard in the landscape of open-source LLMs.

The implications of this research extend beyond present achievements. The methodology of integrating executable code with natural LLMs can revolutionize problem-solving in fields where computational support is indispensable. Future explorations could diversify the problem domains addressed by MathCoder, potentially including multi-modal data to tackle a wider array of challenges. Optimizing computational efficiency for larger datasets and further bridging the performance discrepancy with closed-source models stand as viable pursuits.

While the research secures significant advancements, it also underscores limitations, such as the potential over-reliance on seed data from established datasets and the need to ensure computational scalability. Nevertheless, MathCoder marks a pivotal development in nurturing open-access models capable of engaging in high-level mathematical reasoning, thus contributing a valuable resource to both academia and industry.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Ke Wang (529 papers)
  2. Houxing Ren (16 papers)
  3. Aojun Zhou (45 papers)
  4. Zimu Lu (10 papers)
  5. Sichun Luo (15 papers)
  6. Weikang Shi (9 papers)
  7. Renrui Zhang (100 papers)
  8. Linqi Song (93 papers)
  9. Mingjie Zhan (23 papers)
  10. Hongsheng Li (340 papers)
Citations (60)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com