Orca-Math: Unlocking the potential of SLMs in Grade School Math (2402.14830v1)

Published 16 Feb 2024 in cs.CL and cs.AI

Abstract: Mathematical word problem-solving has long been recognized as a complex task for small LLMs (SLMs). A recent study hypothesized that the smallest model size, needed to achieve over 80% accuracy on the GSM8K benchmark, is 34 billion parameters. To reach this level of performance with smaller models, researcher often train SLMs to generate Python code or use tools to help avoid calculation errors. Additionally, they employ ensembling, where outputs of up to 100 model runs are combined to arrive at a more accurate result. Result selection is done using consensus, majority vote or a separate a verifier model used in conjunction with the SLM. Ensembling provides a substantial boost in accuracy but at a significant cost increase with multiple calls to the model (e.g., Phi-GSM uses top-48 to boost the performance from 68.2 to 81.5). In this work, we present Orca-Math, a 7-billion-parameter SLM based on the Mistral-7B, which achieves 86.81% on GSM8k without the need for multiple model calls or the use of verifiers, code execution or any other external tools. Our approach has the following key elements: (1) A high quality synthetic dataset of 200K math problems created using a multi-agent setup where agents collaborate to create the data, (2) An iterative learning techniques that enables the SLM to practice solving problems, receive feedback on its solutions and learn from preference pairs incorporating the SLM solutions and the feedback. When trained with Supervised Fine-Tuning alone, Orca-Math achieves 81.50% on GSM8k pass@1 metric. With iterative preference learning, Orca-Math achieves 86.81% pass@1. Orca-Math surpasses the performance of significantly larger models such as LLAMA-2-70B, WizardMath-70B, Gemini-Pro, ChatGPT-3.5. It also significantly outperforms other smaller models while using much smaller data (hundreds of thousands vs. millions of problems).

PDF HTML Abstract

Analysis of "Orca-Math: Unlocking the Potential of SLMs in Grade School Math"

The presented paper introduces Orca-Math, a novel approach to bolstering the mathematical problem-solving capabilities of Small LLMs (SLMs) using a 7-billion-parameter model derived from Mistral-7B. The paper addresses the challenges of efficiently achieving high performance on mathematical benchmarks like GSM8K without relying on resource-intensive practices such as model ensembling or extensive data augmentation. Herein lies the significance of Orca-Math, which demonstrates that smaller models can attain a competitive accuracy of 86.81% on GSM8K with just 200,000 synthetic math problems.

Methodological Innovations

The methodology encompasses several critical elements:

Synthetic Dataset Generation: A core innovation is the creation of a 200k problem set using a multi-agent framework. This consists of both straightforward problem transformations and more complex variations involving multiple stages of refinement. Notably, this dataset incorporates a collaborative "Agent-Instruct" system that synthesizes problems to match varying levels of difficulty, thus maintaining robust diversity.
Iterative Learning Procedures: The model is refined through successive training iterations involving supervised fine-tuning (SFT) and preference learning using both Direct Preference Optimization (DPO) and Kahneman-Tversky Optimization (KTO). This iterative process is designed to integrate feedback effectively and guide the model toward superior decision-making when addressing mathematical tasks.
Evaluation and Feedback Integration: Solutions generated by Orca-Math are evaluated via a GPT4-based exact-match metric, ensuring that feedback is specific and aligns closely with expert-level mathematical reasoning. This feedback is crucial to the iterative improvement strategy demonstrated in the model.

Experimental Results

The Orca-Math approach outperforms several larger and more resource-dependent models, such as LLAMA-2-70B and WizardMath-70B, both in mathematical reasoning and specific benchmarks like GSM8K. The iterative learning framework shows consistent gains across each stage. The remarkable result is evident not only in the strong performance metrics but in the model's capacity to rival much larger models with an optimized dataset size and training regimen.

Implications and Future Directions

The results indicate promising avenues for future research in optimizing computational resources while enhancing the reasoning capabilities of SLMs. The techniques demonstrated could be further explored across other domains beyond mathematics, suggesting broad implications for AI’s efficiency in learning complex tasks.

Moreover, the agent-based dataset generation and preference learning strategies may inform the development of next-generation LLMs that require less data and compute while achieving higher degrees of comprehension and problem-solving accuracy. This work brings attention to the potential of carefully designed learning loops and high-quality synthetic data in empowering SLMs.

In summary, the Orca-Math framework stands as a testament to the viability of smaller models achieving near-parity with their larger counterparts through innovation in data synthesis and preference-driven learning. This research contributes substantially to ongoing discussions about the scalability and efficiency of AI models in educational and pedagogical applications.