Emergent Mind


Large language models (LLMs) have seen considerable advancements in natural language understanding tasks, yet there remains a gap to bridge before attaining true artificial general intelligence, especially concerning shortcomings in mathematical reasoning capabilities. We postulate that the inherent nature of LLM training, which focuses on predicting probabilities of next token, presents challenges in effectively modeling mathematical reasoning that demands exact calculations, both from data-driven and theoretical standpoints. In this paper, we address this challenge by enriching the data landscape and introducing a novel math dataset, enhanced with a capability to utilize a Python code interpreter. This dataset is derived from GSM8K and MATH and has been further refined through a combination of GPT-4 annotations, human review, and self-training processes, where the errors in the original GSM8K training set have been fixed. Additionally, we propose a tentative, easily replicable protocol for the fine-tuning of math-specific LLMs, which has led to a significant improvement in the performance of a 7B-parameter LLM on the GSM8K and MATH datasets. We are committed to advancing the field of mathematical reasoning in LLMs and, to that end, we have made source code for data generation / training / inference, and the model checkpoints publicly available at \url{https://github.com/MARIO-Math-Reasoning/MARIO}. We hope this will facilitate further research and development within the community.


  • MARIO enhances LLMs for math reasoning by correcting datasets and including code for precise calculations.

  • The study improved GSM8K and introduced the advanced MATH dataset for richer problem-solving training.

  • The MATH dataset underwent error rectification and solution generation using a fine-tuned LLM and expert review.

  • A fine-tuning pipeline using the Llemma model involves CPT, SFT, and multi-task OVM to prevent overfitting and improve solution generation and evaluation.

  • MARIO sets new performance benchmarks and demonstrates strong generalization in mathematical reasoning tasks, favoring text-plus-code solutions.


The integration of Python code interpreter capabilities into LLMs can significantly enhance their performance in mathematical reasoning tasks. In light of this, a new study introduces "MARIO: MAth Reasoning with code Interpreter Output," an architecture designed for fine-tuning LLMs, leading to substantial improvements in this domain. This is achieved by meticulously correcting existing datasets and augmenting them with code snippets that ensure precise computation, supported by a multi-task fine-tuning approach. This work lays the groundwork for advancing the potential of LLMs in tasks requiring both textual reasoning and exact calculations.


A crux of this study involves creating a robust dataset that elegantly combines textual analysis with code execution. Starting with common grade school math problems, the authors enhanced the GSM8K dataset and introduced a more challenging MATH dataset. These datasets underwent a thorough process involving GPT model annotations, expert review, and self-training to rectify errors and generate accurate solutions. For the recondite MATH dataset, the authors embraced a self-training approach with a finely-tuned LLM to discern correct solutions, which were then used to further the model's capabilities. Augmenting these datasets with questions from MetaMath added diversity, creating a rich den of problems totaling approximately 28.8K solutions. The authors favor an HTML-like format for fine-tuning, which seems to resonate better with the model's pre-training corpus.


At the heart of MARIO is the fine-tuning pipeline leveraging the Llemma model—a math-oriented version of Llama-2—which is already proficient in mathematical and coding domains. This alignment is salient, as neither Llama-2 nor Llemma demonstrate overfitting on relevant datasets, confirming their generalization ability. The authors propose a three-stage fine-tuning pipeline, including continual pre-training (CPT), supervised fine-tuning (SFT), and multi-task outcome value model (OVM) fine-tuning. Through this tripartite strategy, MARIO not only generates solutions but also evaluates them, facilitating a self-verifying mechanism crucial for mathematical problem-solving.

Experiments and Contributions

Demonstrated results show MARIO setting new benchmarks on MATH datasets for models around the 7B parameter size. This is further highlighted by its commendable generalization on strenuous out-of-domain math datasets. Its OVM can independently generate and evaluate solutions, suggesting its proficiency in model reasoning and verification processes. The authors note, though, that their model's strength lies in complex problems that demand an analytical breakdown of the question, favoring their text-plus-code based approach over purely code-centric solutions.

In conclusion, MARIO represents a significant leap forward in the development of LLMs for mathematical reasoning, providing a reinforced dataset and fine-tuning approach that skillfully addresses limitations in exact calculations while fostering logical reasoning. The reproducible pipeline and the public release of model checkpoints suggest an open and collaborative future for research in this specialized area of AI.

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.

Test Your Knowledge

You answered out of questions correctly.

Well done!