Papers
Topics
Authors
Recent
Search
2000 character limit reached

MathGenie: Generating Synthetic Data with Question Back-translation for Enhancing Mathematical Reasoning of LLMs

Published 26 Feb 2024 in cs.CL and cs.AI | (2402.16352v2)

Abstract: LLMs have exhibited great potential in mathematical reasoning. However, there remains a performance gap in this area between existing open-source models and closed-source models such as GPT-4. In this paper, we introduce MathGenie, a novel method for generating diverse and reliable math problems from a small-scale problem-solution dataset (denoted as seed data). We augment the ground-truth solutions of our seed data and train a back-translation model to translate the augmented solutions back into new questions. Subsequently, we generate code-integrated solutions for the new questions. To ensure the correctness of the code-integrated solutions, we employ rationale-based strategy for solution verification. Various pretrained models, ranging from 7B to 70B, are trained on the newly curated data to test the effectiveness of the proposed augmentation technique, resulting in a family of models known as MathGenieLM. These models consistently outperform previous open-source models across five representative mathematical reasoning datasets, achieving state-of-the-art performance. In particular, MathGenieLM-InternLM2 achieves an accuracy of 87.7% on GSM8K and 55.7% on MATH, securing the best overall score among open-source LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Palm 2 technical report. arXiv preprint arXiv:2305.10403.
  2. Llemma: An open language model for mathematics. arXiv preprint arXiv:2310.10631.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  4. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588.
  5. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  6. Advancing mathematics by guiding human intuition with ai. Nature, 600:70 – 74.
  7. Tora: A tool-integrated reasoning agent for mathematical problem solving. arXiv preprint arXiv:2309.17452.
  8. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874.
  9. Mistral 7b. arXiv preprint arXiv:2310.06825.
  10. Mixtral of experts. arXiv preprint arXiv:2401.04088.
  11. Mawps: A math word problem repository. In North American Chapter of the Association for Computational Linguistics.
  12. Query and response augmentation cannot help out-of-domain math reasoning generalization. arXiv preprint arXiv:2310.05506.
  13. Self-alignment with instruction backtranslation. arXiv preprint arXiv:2308.06259.
  14. Haoxiong Liu and Andrew Chi-Chih Yao. 2024. Augmenting math word problems via iterative question composing. arXiv preprint arXiv:2401.09003.
  15. Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
  16. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255.
  17. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583.
  18. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707.
  19. OpenAI. 2023. Gpt-4 technical report. ArXiv, abs/2303.08774.
  20. Are nlp models really able to solve simple math word problems? arXiv preprint arXiv:2103.07191.
  21. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
  22. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  23. Beyond human data: Scaling self-training for problem-solving with language models. arXiv preprint arXiv:2312.06585.
  24. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  25. InternLM Team. 2023. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM-techreport.
  26. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  27. Mathcoder: Seamless code integration in llms for enhanced mathematical reasoning. In International Conference on Learning Representations.
  28. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
  29. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  30. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
  31. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284.
  32. Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825.
  33. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653.
  34. Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. In International Conference on Learning Representations.
Citations (23)

Summary

  • The paper presents a novel back-translation framework that iteratively augments math solutions to generate high-quality synthetic questions.
  • It employs verification-based filtering with a code-integrated model to ensure accurate, diverse math datasets for training LLMs.
  • Empirical results show state-of-the-art performance on benchmarks like GSM8K and MATH, with improvements up to 87.7% accuracy.

MathGenie: Generating Synthetic Data with Question Back-translation for Enhancing Mathematical Reasoning of LLMs

The paper "MathGenie: Generating Synthetic Data with Question Back-translation for Enhancing Mathematical Reasoning of LLMs" introduces a novel framework to improve mathematical reasoning capabilities of LLMs through a robust data augmentation pipeline. This method focuses on generating synthetic math problems and solutions by utilizing a unique back-translation approach. It enhances model training with diverse and reliable mathematical datasets while integrating solutions verification to ensure data quality.

MathGenie Framework

Iterative Solution Augmentation

The core idea behind MathGenie is the iterative augmentation of existing solution data, which is subsequently transformed into new questions through a back-translation model. Unlike direct question augmentation, MathGenie emphasizes transforming solutions first and generating questions from these augmented solutions. This methodology maintains the logical structure and constraints of mathematical solutions, producing high-quality, diverse question sets. Figure 1

Figure 1: Framework of MathGenie. Iterative Solution Augmentation augments solutions to create new questions verified and curated through Question Back-translation and Verification-Based Solution Filtering.

Question Back-translation

The MathGenie pipeline employs a Question Back-translation model to convert the augmented solutions into corresponding math problems. This process ensures the generation of diverse and logically consistent questions. The model, termed Mbacktrans, undergoes finetuning on a reversed dataset of question-solution pairs, enhancing the fidelity of generated questions while considering the intricate constraints of mathematical reasoning.

Verification and Data Curation

Verification-Based Solution Filtering

Verification of solutions is critical in MathGenie's pipeline to ensure the generated data's accuracy. A model, Mcode, is deployed to verify code-integrated solutions by embedding code-based rationales. This step filters out incorrect solutions, retaining only those verified to be accurate, thus contributing to high-quality dataset creation for model training. Figure 2

Figure 2: Performance of the Mistral 7B model finetuned with different scales of augmented data.

Evaluation and Results

Empirical Analysis

The practical enhancement of mathematical problem-solving in LLMs is evaluated through extensive experiments on diverse datasets, including GSM8K, MATH, and Simuleq. MathGenieLM, the resultant family of models, consistently surpasses prior open-source models, achieving state-of-the-art performance across these benchmarks.

The MathGenieLM-InternLM2, notably, reaches an accuracy of 87.7% on GSM8K and 55.7% on MATH, showcasing advancements in mathematical reasoning capabilities through the proposed synthetic data augmentation techniques.

Ablation and Comparative Studies

A series of ablation studies illustrate the impact of utilizing iterative augmentation and verification in data preparation. Moreover, the effectiveness of MathGenie's approach is compared to other augmentation methodologies, such as MetaMath and direct question augmentation, revealing MathGenie's superior ability to generate reliable and effective problem sets.

Implications and Future Work

MathGenie's framework provides a scalable and cost-effective approach for generating synthetic math data, potentially applicable to other reasoning tasks. Future developments could explore reducing computational demands and extending the model's capabilities to handle visual content, thus broadening the scope of problem-solving that MathGenieLM can tackle.

Conclusion

The research presents MathGenie as a sophisticated, efficient pipeline that strengthens mathematical reasoning in LLMs by leveraging iterative solution augmentation and question back-translation combined with verification filtering. The framework notably enhances the reliability and diversity of mathematical data, culminating in improved model performance, with promising applications for diverse AI reasoning tasks.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.