Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MathGenie: Generating Synthetic Data with Question Back-translation for Enhancing Mathematical Reasoning of LLMs (2402.16352v2)

Published 26 Feb 2024 in cs.CL and cs.AI

Abstract: LLMs have exhibited great potential in mathematical reasoning. However, there remains a performance gap in this area between existing open-source models and closed-source models such as GPT-4. In this paper, we introduce MathGenie, a novel method for generating diverse and reliable math problems from a small-scale problem-solution dataset (denoted as seed data). We augment the ground-truth solutions of our seed data and train a back-translation model to translate the augmented solutions back into new questions. Subsequently, we generate code-integrated solutions for the new questions. To ensure the correctness of the code-integrated solutions, we employ rationale-based strategy for solution verification. Various pretrained models, ranging from 7B to 70B, are trained on the newly curated data to test the effectiveness of the proposed augmentation technique, resulting in a family of models known as MathGenieLM. These models consistently outperform previous open-source models across five representative mathematical reasoning datasets, achieving state-of-the-art performance. In particular, MathGenieLM-InternLM2 achieves an accuracy of 87.7% on GSM8K and 55.7% on MATH, securing the best overall score among open-source LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Palm 2 technical report. arXiv preprint arXiv:2305.10403.
  2. Llemma: An open language model for mathematics. arXiv preprint arXiv:2310.10631.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  4. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588.
  5. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  6. Advancing mathematics by guiding human intuition with ai. Nature, 600:70 – 74.
  7. Tora: A tool-integrated reasoning agent for mathematical problem solving. arXiv preprint arXiv:2309.17452.
  8. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874.
  9. Mistral 7b. arXiv preprint arXiv:2310.06825.
  10. Mixtral of experts. arXiv preprint arXiv:2401.04088.
  11. Mawps: A math word problem repository. In North American Chapter of the Association for Computational Linguistics.
  12. Query and response augmentation cannot help out-of-domain math reasoning generalization. arXiv preprint arXiv:2310.05506.
  13. Self-alignment with instruction backtranslation. arXiv preprint arXiv:2308.06259.
  14. Haoxiong Liu and Andrew Chi-Chih Yao. 2024. Augmenting math word problems via iterative question composing. arXiv preprint arXiv:2401.09003.
  15. Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
  16. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255.
  17. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583.
  18. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707.
  19. OpenAI. 2023. Gpt-4 technical report. ArXiv, abs/2303.08774.
  20. Are nlp models really able to solve simple math word problems? arXiv preprint arXiv:2103.07191.
  21. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
  22. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  23. Beyond human data: Scaling self-training for problem-solving with language models. arXiv preprint arXiv:2312.06585.
  24. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  25. InternLM Team. 2023. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM-techreport.
  26. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  27. Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning. In International Conference on Learning Representations.
  28. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
  29. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  30. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
  31. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284.
  32. Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825.
  33. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653.
  34. Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. In International Conference on Learning Representations.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Zimu Lu (10 papers)
  2. Aojun Zhou (45 papers)
  3. Houxing Ren (16 papers)
  4. Ke Wang (529 papers)
  5. Weikang Shi (9 papers)
  6. Junting Pan (30 papers)
  7. Mingjie Zhan (23 papers)
  8. Hongsheng Li (340 papers)
Citations (23)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets