Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning (2309.05653v3)

Published 11 Sep 2023 in cs.CL
MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

Abstract: We introduce MAmmoTH, a series of open-source LLMs specifically tailored for general math problem-solving. The MAmmoTH models are trained on MathInstruct, our meticulously curated instruction tuning dataset. MathInstruct is compiled from 13 math datasets with intermediate rationales, six of which have rationales newly curated by us. It presents a unique hybrid of chain-of-thought (CoT) and program-of-thought (PoT) rationales, and also ensures extensive coverage of diverse fields in math. The hybrid of CoT and PoT not only unleashes the potential of tool use but also allows different thought processes for different math problems. As a result, the MAmmoTH series substantially outperform existing open-source models on nine mathematical reasoning datasets across all scales with an average accuracy gain between 16% and 32%. Remarkably, our MAmmoTH-7B model reaches 33% on MATH (a competition-level dataset), which exceeds the best open-source 7B model (WizardMath) by 23%, and the MAmmoTH-34B model achieves 44% accuracy on MATH, even surpassing GPT-4's CoT result. Our work underscores the importance of diverse problem coverage and the use of hybrid rationales in developing superior math generalist models.

Overview of MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning

The paper presents "MAmmoTH," a series of open-source LLMs specialized in mathematical problem solving through a novel approach of hybrid instruction tuning. Specifically tailored to enhance the mathematical reasoning capabilities of LLMs, MAmmoTH models are trained on "MathInstruct," an instruction dataset that ingeniously combines Chain-of-Thought (CoT) and Program-of-Thought (PoT) rationales across a broad spectrum of mathematical subjects. The authors claim significant performance improvements over existing solutions on various mathematical reasoning benchmarks.

Core Contributions

  1. Hybrid Instruction Tuning Dataset: MathInstruct

The MathInstruct dataset is a central contribution, encompassing diverse mathematical fields and complexity levels. It integrates CoT and PoT rationales collected from 13 publicly available datasets, alongside six new datasets curated by the authors. This hybrid educational approach aims to leverage the strengths of both CoT, which facilitates reasoning through step-by-step thought processes, and PoT, which engages external tools like Python for calculation-heavy problems.

  1. Strong Empirical Performance

The paper reports that MAmmoTH models achieve an average accuracy gain between 16% to 32% on nine mathematical reasoning datasets at various scales. Notably, the MAmmoTH-7B model achieves 33% accuracy on the MATH competition-level dataset, outperforming the best comparable open-source model, WizardMath, by 23%. Further evaluation demonstrates the MAmmoTH-34B model's ability to surpass even closed-source models like GPT-4's CoT results.

  1. Evaluation and Baselines

The evaluation setup involves both in-domain (IND) and out-of-domain (OOD) test sets, covering datasets like GSM8K, MATH, AQuA-RAT, NumGLUE, and others. By outperforming both closed- and open-source models across these evaluations, the MAmmoTH series establishes a new benchmark for open-source LLMs in mathematical problem solving.

  1. Data Engineering and Implications for Future LLM Development

The engineering of MathInstruct demonstrates the critical role of diverse problem datasets in creating robust, generalist LLMs. The integration of hybrid rationales provides a dual approach to tackling mathematical problems, which can accommodate the varied nature of such tasks. The paper suggests that enhancing the training data with diverse sources promotes the model's generalizer skills, an insight that could drive future frameworks in LLM training for domain-specific tasks.

Implications and Future Directions

The work on MAmmoTH opens several avenues for future explorations. The hybrid instruction tuning method stands as a promising direction for developing LLMs in domains requiring both precise computations and complex multi-hop reasoning. Future research might consider expanding the scope to include different branches of mathematics or adapting the hybrid instruction tuning approach to other scientific fields requiring reasoning and calculation. There is also potential in examining the synergistic effects between CoT and PoT rationales when applied to other complex reasoning challenges.

While the hybrid models demonstrate superior adaptability and accuracy in various mathematical reasoning tasks, the paper also acknowledges that broader domain coverages and the incorporation of theorem-proving tasks would further enhance LLM capabilities. As models trained under this framework show marked improvement over existing baselines, the MAmmoTH series establishes a foundation for ongoing enhancements in mathematical AI models.

In summary, the paper illustrates MAmmoTH as a pivotal step in developing LLMs for mathematical reasoning, surpassing many existing models in efficacy and offering insights that could inform subsequent developments in specialized LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. MathQA: Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  2357–2367, 2019. doi: 10.18653/v1/N19-1245. URL https://aclanthology.org/N19-1245.
  2. Palm 2 technical report. ArXiv preprint, abs/2305.10403, 2023. URL https://arxiv.org/abs/2305.10403.
  3. Constitutional ai: Harmlessness from ai feedback. ArXiv preprint, abs/2212.08073, 2022. URL https://arxiv.org/abs/2212.08073.
  4. Evaluating large language models trained on code. ArXiv preprint, abs/2107.03374, 2021. URL https://arxiv.org/abs/2107.03374.
  5. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. ArXiv preprint, abs/2211.12588, 2022. URL https://arxiv.org/abs/2211.12588.
  6. Theoremqa: A theorem-driven question answering dataset. ArXiv preprint, abs/2305.12524, 2023. URL https://arxiv.org/abs/2305.12524.
  7. Scaling instruction-finetuned language models. ArXiv preprint, abs/2210.11416, 2022. URL https://arxiv.org/abs/2210.11416.
  8. Training verifiers to solve math word problems. ArXiv preprint, abs/2110.14168, 2021. URL https://arxiv.org/abs/2110.14168.
  9. Advancing mathematics by guiding human intuition with ai. Nature, 600(7887):70–74, 2021. URL https://www.nature.com/articles/s41586-021-04086-x.
  10. Qlora: Efficient finetuning of quantized llms. ArXiv preprint, abs/2305.14314, 2023. URL https://arxiv.org/abs/2305.14314.
  11. Compositional semantic parsing with large language models. International Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/forum?id=gJW8hSGBys8.
  12. Pal: Program-aided language models. In International Conference on Machine Learning, pp. 10764–10799. PMLR, 2023. URL https://proceedings.mlr.press/v202/gao23f/gao23f.pdf.
  13. Critic: Large language models can self-correct with tool-interactive critiquing. ArXiv preprint, abs/2305.11738, 2023. URL https://arxiv.org/abs/2305.11738.
  14. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, 2021a. URL https://openreview.net/forum?id=d7KBjmI3GmQ.
  15. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021b. URL https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/file/be83ab3ecd0db773eb2dc1b0a17836a1-Paper-round2.pdf.
  16. Learning to solve arithmetic word problems with verb categorization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  523–533, 2014. doi: 10.3115/v1/D14-1058. URL https://aclanthology.org/D14-1058.
  17. Large language models are zero-shot reasoners. NeurIPS, 2022.
  18. Parsing algebraic word problems into equations. Transactions of the Association for Computational Linguistics, 3:585–597, 2015. doi: 10.1162/tacl˙a˙00160. URL https://aclanthology.org/Q15-1042.
  19. MAWPS: A math word problem repository. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  1152–1157, 2016. doi: 10.18653/v1/N16-1136. URL https://aclanthology.org/N16-1136.
  20. Platypus: Quick, cheap, and powerful refinement of llms. ArXiv preprint, abs/2308.07317, 2023. URL https://arxiv.org/abs/2308.07317.
  21. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857, 2022. URL https://openreview.net/pdf?id=IFXTZERXdM7.
  22. Camel: Communicative agents for” mind” exploration of large scale language model society. ArXiv preprint, abs/2303.17760, 2023a. URL https://arxiv.org/abs/2303.17760.
  23. Making language models better reasoners with step-aware verifier. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  5315–5333, 2023b. URL https://aclanthology.org/2023.acl-long.291.pdf.
  24. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  158–167, 2017. doi: 10.18653/v1/P17-1015. URL https://aclanthology.org/P17-1015.
  25. The flan collection: Designing data and methods for effective instruction tuning. ICML, 2023. URL https://openreview.net/pdf?id=ZX4uS605XV.
  26. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. ArXiv preprint, abs/2308.09583, 2023. URL https://arxiv.org/abs/2308.09583.
  27. Language models of code are few-shot commonsense learners. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  1384–1403, 2022. URL https://aclanthology.org/2022.emnlp-main.90.pdf.
  28. LILA: A unified benchmark for mathematical reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  5807–5832, 2022a. URL https://aclanthology.org/2022.emnlp-main.392.
  29. NumGLUE: A suite of fundamental yet challenging mathematical reasoning tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  3505–3523, 2022b. doi: 10.18653/v1/2022.acl-long.246. URL https://aclanthology.org/2022.acl-long.246.
  30. Orca: Progressive learning from complex explanation traces of gpt-4. ArXiv preprint, abs/2306.02707, 2023. URL https://arxiv.org/abs/2306.02707.
  31. Codegen: An open large language model for code with multi-turn program synthesis. In International Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/pdf?id=iaYcJKpY2B_.
  32. Show your work: Scratchpads for intermediate computation with language models. In Deep Learning for Code Workshop, 2022. URL https://arxiv.org/abs/2112.00114.
  33. OpenAI. Gpt-4 technical report. ArXiv preprint, abs/2303.08774, 2023. URL https://arxiv.org/abs/2303.08774.
  34. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  2080–2094, 2021. doi: 10.18653/v1/2021.naacl-main.168. URL https://aclanthology.org/2021.naacl-main.168.
  35. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. ArXiv preprint, abs/2306.01116, 2023. URL https://arxiv.org/abs/2306.01116.
  36. Instruction tuning with gpt-4. ArXiv preprint, abs/2304.03277, 2023. URL https://arxiv.org/abs/2304.03277.
  37. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–16. IEEE, 2020. URL https://dl.acm.org/doi/10.5555/3433701.3433727.
  38. Solving general arithmetic word problems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp.  1743–1752, 2015. doi: 10.18653/v1/D15-1202. URL https://aclanthology.org/D15-1202.
  39. Code llama: Open foundation models for code. ArXiv preprint, abs/2308.12950, 2023. URL https://arxiv.org/abs/2308.12950.
  40. Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022, 2022. URL https://openreview.net/forum?id=9Vrb9D0WI4.
  41. Challenging big-bench tasks and whether chain-of-thought can solve them. ArXiv preprint, abs/2210.09261, 2022. URL https://arxiv.org/abs/2210.09261.
  42. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  43. Galactica: A large language model for science. ArXiv preprint, abs/2211.09085, 2022. URL https://arxiv.org/abs/2211.09085.
  44. Llama: Open and efficient foundation language models. ArXiv preprint, abs/2302.13971, 2023a. URL https://arxiv.org/abs/2302.13971.
  45. Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288, 2023b. URL https://arxiv.org/abs/2307.09288.
  46. Iteratively prompt pre-trained language models for chain of thought. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  2714–2730. Association for Computational Linguistics, 2022a. URL https://aclanthology.org/2022.emnlp-main.174.
  47. Towards understanding chain-of-thought prompting: An empirical study of what matters. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  2717–2739. Association for Computational Linguistics, 2023a. doi: 10.18653/v1/2023.acl-long.153. URL https://aclanthology.org/2023.acl-long.153.
  48. Can chatgpt defend the truth? automatic dialectical evaluation elicits llms’ deficiencies in reasoning. ArXiv preprint, abs/2305.13160, 2023b. URL https://arxiv.org/abs/2305.13160.
  49. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. ArXiv preprint, abs/2305.04091, 2023c. URL https://arxiv.org/abs/2305.04091.
  50. Making large language models better reasoners with alignment. ArXiv preprint, abs/2309.02144, 2023d. URL https://arxiv.org/abs/2309.02144.
  51. Scibench: Evaluating college-level scientific problem-solving abilities of large language models. ArXiv preprint, abs/2307.10635, 2023e. URL https://arxiv.org/abs/2307.10635.
  52. Self-consistency improves chain of thought reasoning in language models. International Conference on Learning Representations (ICLR), 2023f. URL https://openreview.net/pdf?id=1PL1NIMMrw.
  53. Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  5085–5109, 2022b. URL https://aclanthology.org/2022.emnlp-main.340.
  54. How far can camels go? exploring the state of instruction tuning on open resources. ArXiv preprint, abs/2306.04751, 2023g. URL https://arxiv.org/abs/2306.04751.
  55. Self-instruct: Aligning language model with self generated instructions. The 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023), 2023h. URL https://aclanthology.org/2023.acl-long.754.pdf.
  56. Codet5+: Open code large language models for code understanding and generation. ArXiv preprint, abs/2305.07922, 2023i. URL https://arxiv.org/abs/2305.07922.
  57. Finetuned language models are zero-shot learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022, 2022a. URL https://openreview.net/forum?id=gEZrGCozdqR.
  58. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022b. URL https://openreview.net/pdf?id=_VjQlMeSB_J.
  59. Simple synthetic data reduces sycophancy in large language models. ArXiv preprint, abs/2308.03958, 2023. URL https://arxiv.org/abs/2308.03958.
  60. Huggingface’s transformers: State-of-the-art natural language processing. ArXiv preprint, abs/1910.03771, 2019. URL https://arxiv.org/abs/1910.03771.
  61. An explanation of in-context learning as implicit bayesian inference. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022, 2022. URL https://openreview.net/forum?id=RdJVFCHjUMI.
  62. Decomposition enhances reasoning via self-evaluation guided decoding. ArXiv preprint, abs/2305.00633, 2023. URL https://arxiv.org/abs/2305.00633.
  63. Wizardlm: Empowering large language models to follow complex instructions. ArXiv preprint, abs/2304.12244, 2023. URL https://arxiv.org/abs/2304.12244.
  64. Gpt can solve mathematical problems without a calculator. ArXiv preprint, abs/2309.03241, 2023. URL https://arxiv.org/abs/2309.03241.
  65. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/pdf?id=WE_vluYUL-X.
  66. CrossFit: A few-shot learning challenge for cross-task generalization in NLP. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  7163–7189, 2021. doi: 10.18653/v1/2021.emnlp-main.572. URL https://aclanthology.org/2021.emnlp-main.572.
  67. Metamath: Bootstrap your own mathematical questions for large language models. ArXiv preprint, abs/2309.12284, 2023. URL https://arxiv.org/abs/2309.12284.
  68. Scaling relationship on learning mathematical reasoning with large language models. ArXiv preprint, abs/2308.01825, 2023. URL https://arxiv.org/abs/2308.01825.
  69. Opt: Open pre-trained transformer language models. ArXiv preprint, abs/2205.01068, 2022. URL https://arxiv.org/abs/2205.01068.
  70. Progressive-hint prompting improves reasoning in large language models. ArXiv preprint, abs/2304.09797, 2023a. URL https://arxiv.org/abs/2304.09797.
  71. Judging llm-as-a-judge with mt-bench and chatbot arena. ArXiv preprint, abs/2306.05685, 2023b. URL https://arxiv.org/abs/2306.05685.
  72. Agieval: A human-centric benchmark for evaluating foundation models. ArXiv preprint, abs/2304.06364, 2023. URL https://arxiv.org/abs/2304.06364.
  73. Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. ArXiv preprint, abs/2308.07921, 2023a. URL https://arxiv.org/abs/2308.07921.
  74. Lima: Less is more for alignment. ArXiv preprint, abs/2305.11206, 2023b. URL https://arxiv.org/abs/2305.11206.
  75. Least-to-most prompting enables complex reasoning in large language models. International Conference on Learning Representations (ICLR), 2023c. URL https://openreview.net/pdf?id=WZH7099tgfM.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Xiang Yue (72 papers)
  2. Xingwei Qu (30 papers)
  3. Ge Zhang (170 papers)
  4. Yao Fu (83 papers)
  5. Wenhao Huang (98 papers)
  6. Huan Sun (88 papers)
  7. Yu Su (138 papers)
  8. Wenhu Chen (134 papers)
Citations (291)