Introduction
The integration of Python code interpreter capabilities into LLMs can significantly enhance their performance in mathematical reasoning tasks. In light of this, a new paper introduces "MARIO: MAth Reasoning with code Interpreter Output," an architecture designed for fine-tuning LLMs, leading to substantial improvements in this domain. This is achieved by meticulously correcting existing datasets and augmenting them with code snippets that ensure precise computation, supported by a multi-task fine-tuning approach. This work lays the groundwork for advancing the potential of LLMs in tasks requiring both textual reasoning and exact calculations.
Dataset
A crux of this paper involves creating a robust dataset that elegantly combines textual analysis with code execution. Starting with common grade school math problems, the authors enhanced the GSM8K dataset and introduced a more challenging MATH dataset. These datasets underwent a thorough process involving GPT model annotations, expert review, and self-training to rectify errors and generate accurate solutions. For the recondite MATH dataset, the authors embraced a self-training approach with a finely-tuned LLM to discern correct solutions, which were then used to further the model's capabilities. Augmenting these datasets with questions from MetaMath added diversity, creating a rich den of problems totaling approximately 28.8K solutions. The authors favor an HTML-like format for fine-tuning, which seems to resonate better with the model's pre-training corpus.
Fine-Tuning
At the heart of MARIO is the fine-tuning pipeline leveraging the Llemma model—a math-oriented version of Llama-2—which is already proficient in mathematical and coding domains. This alignment is salient, as neither Llama-2 nor Llemma demonstrate overfitting on relevant datasets, confirming their generalization ability. The authors propose a three-stage fine-tuning pipeline, including continual pre-training (CPT), supervised fine-tuning (SFT), and multi-task outcome value model (OVM) fine-tuning. Through this tripartite strategy, MARIO not only generates solutions but also evaluates them, facilitating a self-verifying mechanism crucial for mathematical problem-solving.
Experiments and Contributions
Demonstrated results show MARIO setting new benchmarks on MATH datasets for models around the 7B parameter size. This is further highlighted by its commendable generalization on strenuous out-of-domain math datasets. Its OVM can independently generate and evaluate solutions, suggesting its proficiency in model reasoning and verification processes. The authors note, though, that their model's strength lies in complex problems that demand an analytical breakdown of the question, favoring their text-plus-code based approach over purely code-centric solutions.
In conclusion, MARIO represents a significant leap forward in the development of LLMs for mathematical reasoning, providing a reinforced dataset and fine-tuning approach that skillfully addresses limitations in exact calculations while fostering logical reasoning. The reproducible pipeline and the public release of model checkpoints suggest an open and collaborative future for research in this specialized area of AI.