Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Orca-Math: Unlocking the potential of SLMs in Grade School Math (2402.14830v1)

Published 16 Feb 2024 in cs.CL and cs.AI

Abstract: Mathematical word problem-solving has long been recognized as a complex task for small LLMs (SLMs). A recent study hypothesized that the smallest model size, needed to achieve over 80% accuracy on the GSM8K benchmark, is 34 billion parameters. To reach this level of performance with smaller models, researcher often train SLMs to generate Python code or use tools to help avoid calculation errors. Additionally, they employ ensembling, where outputs of up to 100 model runs are combined to arrive at a more accurate result. Result selection is done using consensus, majority vote or a separate a verifier model used in conjunction with the SLM. Ensembling provides a substantial boost in accuracy but at a significant cost increase with multiple calls to the model (e.g., Phi-GSM uses top-48 to boost the performance from 68.2 to 81.5). In this work, we present Orca-Math, a 7-billion-parameter SLM based on the Mistral-7B, which achieves 86.81% on GSM8k without the need for multiple model calls or the use of verifiers, code execution or any other external tools. Our approach has the following key elements: (1) A high quality synthetic dataset of 200K math problems created using a multi-agent setup where agents collaborate to create the data, (2) An iterative learning techniques that enables the SLM to practice solving problems, receive feedback on its solutions and learn from preference pairs incorporating the SLM solutions and the feedback. When trained with Supervised Fine-Tuning alone, Orca-Math achieves 81.50% on GSM8k pass@1 metric. With iterative preference learning, Orca-Math achieves 86.81% pass@1. Orca-Math surpasses the performance of significantly larger models such as LLAMA-2-70B, WizardMath-70B, Gemini-Pro, ChatGPT-3.5. It also significantly outperforms other smaller models while using much smaller data (hundreds of thousands vs. millions of problems).

Analysis of "Orca-Math: Unlocking the Potential of SLMs in Grade School Math"

The presented paper introduces Orca-Math, a novel approach to bolstering the mathematical problem-solving capabilities of Small LLMs (SLMs) using a 7-billion-parameter model derived from Mistral-7B. The paper addresses the challenges of efficiently achieving high performance on mathematical benchmarks like GSM8K without relying on resource-intensive practices such as model ensembling or extensive data augmentation. Herein lies the significance of Orca-Math, which demonstrates that smaller models can attain a competitive accuracy of 86.81% on GSM8K with just 200,000 synthetic math problems.

Methodological Innovations

The methodology encompasses several critical elements:

  1. Synthetic Dataset Generation: A core innovation is the creation of a 200k problem set using a multi-agent framework. This consists of both straightforward problem transformations and more complex variations involving multiple stages of refinement. Notably, this dataset incorporates a collaborative "Agent-Instruct" system that synthesizes problems to match varying levels of difficulty, thus maintaining robust diversity.
  2. Iterative Learning Procedures: The model is refined through successive training iterations involving supervised fine-tuning (SFT) and preference learning using both Direct Preference Optimization (DPO) and Kahneman-Tversky Optimization (KTO). This iterative process is designed to integrate feedback effectively and guide the model toward superior decision-making when addressing mathematical tasks.
  3. Evaluation and Feedback Integration: Solutions generated by Orca-Math are evaluated via a GPT4-based exact-match metric, ensuring that feedback is specific and aligns closely with expert-level mathematical reasoning. This feedback is crucial to the iterative improvement strategy demonstrated in the model.

Experimental Results

The Orca-Math approach outperforms several larger and more resource-dependent models, such as LLAMA-2-70B and WizardMath-70B, both in mathematical reasoning and specific benchmarks like GSM8K. The iterative learning framework shows consistent gains across each stage. The remarkable result is evident not only in the strong performance metrics but in the model's capacity to rival much larger models with an optimized dataset size and training regimen.

Implications and Future Directions

The results indicate promising avenues for future research in optimizing computational resources while enhancing the reasoning capabilities of SLMs. The techniques demonstrated could be further explored across other domains beyond mathematics, suggesting broad implications for AI’s efficiency in learning complex tasks.

Moreover, the agent-based dataset generation and preference learning strategies may inform the development of next-generation LLMs that require less data and compute while achieving higher degrees of comprehension and problem-solving accuracy. This work brings attention to the potential of carefully designed learning loops and high-quality synthetic data in empowering SLMs.

In summary, the Orca-Math framework stands as a testament to the viability of smaller models achieving near-parity with their larger counterparts through innovation in data synthesis and preference-driven learning. This research contributes substantially to ongoing discussions about the scalability and efficiency of AI models in educational and pedagogical applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319, 2019.
  3. A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036, 2023.
  4. Llemma: An open language model for mathematics. arXiv preprint arXiv: 2310.10631, 2023.
  5. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024.
  6. Adapting large language models via reading comprehension. arXiv preprint arXiv:2309.09530, 2023.
  7. Training verifiers to solve math word problems. arXiv preprint arXiv: Arxiv-2110.14168, 2021.
  8. Ultrafeedback: Boosting language models with high-quality feedback, 2023.
  9. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023.
  10. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024.
  11. Google Gemini Team. Gemini: A family of highly capable multimodal models.
  12. Tora: A tool-integrated reasoning agent for mathematical problem solving. arXiv preprint arXiv: 2309.17452, 2023.
  13. Learning to solve arithmetic word problems with verb categorization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 523–533, 2014.
  14. Mistral 7b. arXiv preprint arXiv: 2310.06825, 2023.
  15. It ain’t over: A multi-aspect diverse math word problem dataset. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14984–15011, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.927. URL https://aclanthology.org/2023.emnlp-main.927.
  16. Parsing algebraic word problems into equations. Transactions of the Association for Computational Linguistics, 3:585–597, 2015.
  17. Learning to automatically solve algebra word problems. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 271–281, 2014.
  18. Camel: Communicative agents for "mind" exploration of large scale language model society, 2023a.
  19. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023b.
  20. Openorca: An open dataset of gpt augmented flan reasoning traces. https://https://huggingface.co/Open-Orca/OpenOrca, 2023.
  21. Tinygsm: achieving> 80% on gsm8k with small language models. arXiv preprint arXiv:2312.09241, 2023a.
  22. Statistical rejection sampling improves preference optimization. arXiv preprint arXiv:2309.06657, 2023b.
  23. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv: 2308.09583, 2023.
  24. A diverse corpus for evaluating and developing english math word problem solvers. arXiv preprint arXiv:2106.15772, 2021.
  25. Lila: A unified benchmark for mathematical reasoning. arXiv preprint arXiv:2210.17517, 2022a.
  26. Numglue: A suite of fundamental yet challenging mathematical reasoning tasks. arXiv preprint arXiv:2204.05660, 2022b.
  27. Orca 2: Teaching small language models how to reason, 2023.
  28. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707, 2023.
  29. OpenAI. Gpt-4 technical report, 2023.
  30. Are nlp models really able to solve simple math word problems? arXiv preprint arXiv:2103.07191, 2021.
  31. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  32. Solving general arithmetic word problems. arXiv preprint arXiv:1608.01413, 2016.
  33. Reasoning about quantities in natural language. Transactions of the Association for Computational Linguistics, 3:1–13, 2015.
  34. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv: 2307.09288, 2023.
  35. Draw: A challenging and diverse algebra word problem set. Technical report, Citeseer, 2015.
  36. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  37. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155, 2023.
  38. Outcome-supervised verifiers for planning in mathematical reasoning. arXiv preprint arXiv: 2311.09724, 2023a.
  39. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv: 2309.12284, 2023b.
  40. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023c.
  41. Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2024.
  42. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv: 2309.05653, 2023a.
  43. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653, 2023b.
  44. Templatemath: Syntactic data generation for mathematical problems, 2024.
  45. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Arindam Mitra (40 papers)
  2. Hamed Khanpour (6 papers)
  3. Corby Rosset (21 papers)
  4. Ahmed Awadallah (27 papers)
Citations (38)
Youtube Logo Streamline Icon: https://streamlinehq.com