MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models (2309.12284v4)

Published 21 Sep 2023 in cs.CL and cs.AI

Abstract: LLMs have pushed the limits of natural language understanding and exhibited excellent problem-solving ability. Despite the great success, most existing open-source LLMs (e.g., LLaMA-2) are still far away from satisfactory for solving mathematical problem due to the complex reasoning procedures. To bridge this gap, we propose MetaMath, a fine-tuned LLM that specializes in mathematical reasoning. Specifically, we start by bootstrapping mathematical questions by rewriting the question from multiple perspectives without extra knowledge, which results in a new dataset called MetaMathQA. Then we fine-tune the LLaMA-2 models on MetaMathQA. Experimental results on two popular benchmarks (i.e., GSM8K and MATH) for mathematical reasoning demonstrate that MetaMath outperforms a suite of open-source LLMs by a significant margin. Our MetaMath-7B model achieves 66.4% on GSM8K and 19.4% on MATH, exceeding the state-of-the-art models of the same size by 11.5% and 8.7%. Particularly, MetaMath-70B achieves an accuracy of 82.3% on GSM8K, slightly better than GPT-3.5-Turbo. We release all the MetaMathQA dataset, the MetaMath models with different model sizes and the training code for public use.

PDF Abstract

MetaMath: Advancing Mathematical Reasoning in LLMs

The paper "MetaMath: Bootstrap Your Own Mathematical Questions for LLMs" introduces an innovative approach to enhancing the mathematical problem-solving capabilities of LLMs. Specifically, the authors present MetaMath, a finetuned LLM, and MetaMathQA, a novel dataset designed to train LLMs in mathematical reasoning. This essay delineates the methodologies, experimental results, and implications of this work within the field of AI and mathematical reasoning.

Methodology

The crux of MetaMath's methodology lies in bootstrapping mathematical questions to create a diverse and rich training dataset. The authors augment the training set using several techniques:

Answer Augmentation: By generating multiple reasoning paths towards an answer using few-shot chain-of-thought prompting, augmented data ensures a variety of problem-solving approaches are captured.
Question Rephrasing: Questions are rephrased using GPT-3.5-Turbo to produce alternate versions of the same problem, thereby increasing the diversity of questions available for training.
Backward Reasoning: This involves creating questions that can be solved by reasoning backward from the answer, enhancing the model's ability to handle multi-step reasoning and verification tasks. Two methods are employed here:
- Self-Verification (SV): Reformulates the question into a declarative statement followed by a target-focused query.
- FOBAR: Directly appends the answer to the original question and asks for the missing variable.

These strategies culminate in the MetaMathQA dataset, which includes a balanced mixture of forward-reasoning, rephrased questions, and backward-reasoning tasks.

Experimental Results

Extensive experiments were conducted on two mathematical reasoning benchmarks, GSM8K and MATH, to evaluate MetaMath's performance. The results are compelling:

MetaMath-7B achieved 66.5% on GSM8K and 19.8% on MATH, significantly surpassing state-of-the-art models of comparable size by 11.5% and 8.7%, respectively.
MetaMath-70B outperformed GPT-3.5-Turbo slightly on GSM8K with an accuracy of 82.3%.
Ablation studies indicated that combining answer augmentation with question rephrasing and backward reasoning tasks significantly improved mathematical reasoning performance compared to simpler augmentation methods.

Implications and Future Directions

The implications of this research extend across practical and theoretical dimensions:

Practical Implications:
- Enhanced Educational Tools: MetaMath could be integrated into educational systems to offer better automated tutoring and practice problem generation.
- Improved Performance on Specialized Tasks: By focusing on mathematical reasoning, MetaMath can be utilized in domains requiring precise computational logic, such as financial modeling and scientific research.
Theoretical Implications:
- Question Diversity and LLM Training: The positive correlation between question diversity and model performance underscores the importance of diverse training sets in enhancing the generalization capabilities of LLMs.
- Backward Reasoning: Incorporating backward reasoning in training datasets can alleviate problems related to the Reversal Curse, thus expanding the domain of problems that LLMs can efficiently solve.

Conclusion

MetaMath and its corresponding dataset, MetaMathQA, mark a significant advancement in the mathematical problem-solving capabilities of LLMs. By bootstrapping questions in multiple ways and enhancing diversity, the authors have provided a robust methodology for training LLMs. Future research could explore further augmentations, different types of mathematical problems, and expanding the backward reasoning framework to other domains. The findings also pave the way for innovations in educational technology and specialized computational fields, pushing the envelope of what LLMs can achieve in mathematical reasoning.

PDF Markdown Bookmark Chat (Pro)

Authors (10)

Longhui Yu (15 papers)
Weisen Jiang (15 papers)
Han Shi (27 papers)
Jincheng Yu (31 papers)
Zhengying Liu (26 papers)
Yu Zhang (1399 papers)
James T. Kwok (65 papers)
Zhenguo Li (195 papers)
Adrian Weller (150 papers)
Weiyang Liu (83 papers)

Citations (236)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/realmofresearch/status/1787695122869555227

YouTube

Show All Videos