OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data (2410.01560v2)

Published 2 Oct 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Mathematical reasoning continues to be a critical challenge in LLM development with significant interest. However, most of the cutting-edge progress in mathematical reasoning with LLMs has become \emph{closed-source} due to lack of access to training data. This lack of data access limits researchers from understanding the impact of different choices for synthesizing and utilizing the data. With the goal of creating a high-quality finetuning (SFT) dataset for math reasoning, we conduct careful ablation experiments on data synthesis using the recently released \texttt{Llama3.1} family of models. Our experiments show that: (a) solution format matters, with excessively verbose solutions proving detrimental to SFT performance, (b) data generated by a strong teacher outperforms equally-sized data generated by a weak student model, (c) SFT is robust to low-quality solutions, allowing for imprecise data filtering, and (d) question diversity is crucial for achieving data scaling gains. Based on these insights, we create the OpenMathInstruct-2 dataset, which consists of 14M question-solution pairs ($\approx$ 600K unique questions), making it nearly eight times larger than the previous largest open-source math reasoning dataset. Finetuning the \texttt{Llama-3.1-8B-Base} using OpenMathInstruct-2 outperforms \texttt{Llama3.1-8B-Instruct} on MATH by an absolute 15.9\% (51.9\% $\rightarrow$ 67.8\%). Finally, to accelerate the open-source efforts, we release the code, the finetuned models, and the OpenMathInstruct-2 dataset under a commercially permissive license.

Citations (3)

View on Semantic Scholar

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper introduces OpenMathInstruct-2 to enhance LLM math reasoning by optimizing chain-of-thought and SFT performance by 3.9%.
It employs a robust teacher-model approach, yielding a 7.8% gain over student model data and proving resilient against low-quality inputs.
Expanding question diversity to 6.5K and scaling to 14M pairs elevates MATH benchmark accuracy by up to 15.9%, advancing open-source AI research.

OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data

The paper "OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data" addresses the ongoing challenges in enhancing mathematical reasoning within LLMs. Most state-of-the-art advancements in this area remain inaccessible due to proprietary datasets, constraining researchers' efforts to experiment with and refine data synthesis techniques. This paper proposes the OpenMathInstruct-2, a novel dataset designed to facilitate open-source advancements in this domain.

Methodology and Key Findings

The research team has built the OpenMathInstruct-2 dataset by synthesizing extensive finetuning (SFT) data using the Llama3.1 model family. Key experiments were conducted, yielding several significant insights:

Solution Format: It was observed that excessively verbose solutions negatively impact SFT performance. The paper introduced a more concise chain-of-thought (CoT) solution format, outperforming Llama's original format with a 3.9% increase in performance despite being 40% shorter.
Data Generation: Data generated by a robust teacher model surpassed on-policy data produced by a weaker student model by 7.8% in effectiveness, highlighting the superiority of adopting a stronger instructional model.
Robustness to Low-Quality Data: The SFT process showed resilience to low-quality data, with no significant performance drop observed even with the inclusion of up to 20% low-quality solutions.
Question Diversity: Augmenting question diversity proved crucial, with an increase in unique questions from 1K to 6.5K resulting in a 10.5% improvement on the MATH validation set.

Utilizing these insights, the OpenMathInstruct-2 dataset was created, comprising 14 million question-solution pairs. This is approximately eight times larger than prior comparable open-source math datasets.

Numerical Results and Model Performance

Experimentation with the Llama-3.1-8B-Base model fine-tuned using OpenMathInstruct-2 demonstrated remarkable numerical results. On the MATH benchmark, performance increased by 15.9% from 51.9% to 67.8%. The OpenMath2-Llama3.1-70B achieved a 71.9% accuracy, outperforming its counterpart by 3.9%. These results substantiate the dataset's potential in substantially enhancing LLM mathematical reasoning capabilities.

Implications and Future Work

The release of OpenMathInstruct-2 under a permissive commercial license allows for broad application and further development within the research community. By providing both the dataset and code, this work accelerates open-source efforts and democratizes access to high-quality mathematical reasoning resources.

The paper also lays a robust groundwork for future research into fine-tuning methods and solution formats. The insights around verbosity, teacher model selection, and data diversity offer practical guidance for the synthesis of training datasets in broader AI domains.

Conclusion

While acknowledging the limitations of proprietary datasets, the OpenMathInstruct-2 initiative significantly contributes to the open-source landscape. By transparently sharing findings and tools, it encourages further exploration into LLM-based mathematical reasoning. Researchers are positioned to build upon this work, potentially exploring new applications and further optimizing synthetic data synthesis techniques.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

Authors (6)

Tweets

https://twitter.com/ShubhamToshniw6/status/1915465068353114216

https://twitter.com/HaseoX94/status/1841960864540459015