Orca-Math: Unlocking the potential of SLMs in Grade School Math
Abstract: Mathematical word problem-solving has long been recognized as a complex task for small LLMs (SLMs). A recent study hypothesized that the smallest model size, needed to achieve over 80% accuracy on the GSM8K benchmark, is 34 billion parameters. To reach this level of performance with smaller models, researcher often train SLMs to generate Python code or use tools to help avoid calculation errors. Additionally, they employ ensembling, where outputs of up to 100 model runs are combined to arrive at a more accurate result. Result selection is done using consensus, majority vote or a separate a verifier model used in conjunction with the SLM. Ensembling provides a substantial boost in accuracy but at a significant cost increase with multiple calls to the model (e.g., Phi-GSM uses top-48 to boost the performance from 68.2 to 81.5). In this work, we present Orca-Math, a 7-billion-parameter SLM based on the Mistral-7B, which achieves 86.81% on GSM8k without the need for multiple model calls or the use of verifiers, code execution or any other external tools. Our approach has the following key elements: (1) A high quality synthetic dataset of 200K math problems created using a multi-agent setup where agents collaborate to create the data, (2) An iterative learning techniques that enables the SLM to practice solving problems, receive feedback on its solutions and learn from preference pairs incorporating the SLM solutions and the feedback. When trained with Supervised Fine-Tuning alone, Orca-Math achieves 81.50% on GSM8k pass@1 metric. With iterative preference learning, Orca-Math achieves 86.81% pass@1. Orca-Math surpasses the performance of significantly larger models such as LLAMA-2-70B, WizardMath-70B, Gemini-Pro, ChatGPT-3.5. It also significantly outperforms other smaller models while using much smaller data (hundreds of thousands vs. millions of problems).
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Mathqa: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319, 2019.
- A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036, 2023.
- Llemma: An open language model for mathematics. arXiv preprint arXiv: 2310.10631, 2023.
- Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024.
- Adapting large language models via reading comprehension. arXiv preprint arXiv:2309.09530, 2023.
- Training verifiers to solve math word problems. arXiv preprint arXiv: Arxiv-2110.14168, 2021.
- Ultrafeedback: Boosting language models with high-quality feedback, 2023.
- Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023.
- Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024.
- Google Gemini Team. Gemini: A family of highly capable multimodal models.
- Tora: A tool-integrated reasoning agent for mathematical problem solving. arXiv preprint arXiv: 2309.17452, 2023.
- Learning to solve arithmetic word problems with verb categorization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 523–533, 2014.
- Mistral 7b. arXiv preprint arXiv: 2310.06825, 2023.
- It ain’t over: A multi-aspect diverse math word problem dataset. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14984–15011, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.927. URL https://aclanthology.org/2023.emnlp-main.927.
- Parsing algebraic word problems into equations. Transactions of the Association for Computational Linguistics, 3:585–597, 2015.
- Learning to automatically solve algebra word problems. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 271–281, 2014.
- Camel: Communicative agents for "mind" exploration of large scale language model society, 2023a.
- Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023b.
- Openorca: An open dataset of gpt augmented flan reasoning traces. https://https://huggingface.co/Open-Orca/OpenOrca, 2023.
- Tinygsm: achieving> 80% on gsm8k with small language models. arXiv preprint arXiv:2312.09241, 2023a.
- Statistical rejection sampling improves preference optimization. arXiv preprint arXiv:2309.06657, 2023b.
- Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv: 2308.09583, 2023.
- A diverse corpus for evaluating and developing english math word problem solvers. arXiv preprint arXiv:2106.15772, 2021.
- Lila: A unified benchmark for mathematical reasoning. arXiv preprint arXiv:2210.17517, 2022a.
- Numglue: A suite of fundamental yet challenging mathematical reasoning tasks. arXiv preprint arXiv:2204.05660, 2022b.
- Orca 2: Teaching small language models how to reason, 2023.
- Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707, 2023.
- OpenAI. Gpt-4 technical report, 2023.
- Are nlp models really able to solve simple math word problems? arXiv preprint arXiv:2103.07191, 2021.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
- Solving general arithmetic word problems. arXiv preprint arXiv:1608.01413, 2016.
- Reasoning about quantities in natural language. Transactions of the Association for Computational Linguistics, 3:1–13, 2015.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv: 2307.09288, 2023.
- Draw: A challenging and diverse algebra word problem set. Technical report, Citeseer, 2015.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
- Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155, 2023.
- Outcome-supervised verifiers for planning in mathematical reasoning. arXiv preprint arXiv: 2311.09724, 2023a.
- Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv: 2309.12284, 2023b.
- Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023c.
- Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2024.
- Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv: 2309.05653, 2023a.
- Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653, 2023b.
- Templatemath: Syntactic data generation for mathematical problems, 2024.
- Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.