WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct (2308.09583v1)

Published 18 Aug 2023 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs, such as GPT-4, have shown remarkable performance in NLP tasks, including challenging mathematical reasoning. However, most existing open-source models are only pre-trained on large-scale internet data and without math-related optimization. In this paper, we present WizardMath, which enhances the mathematical reasoning abilities of Llama-2, by applying our proposed Reinforcement Learning from Evol-Instruct Feedback (RLEIF) method to the domain of math. Through extensive experiments on two mathematical reasoning benchmarks, namely GSM8k and MATH, we reveal the extraordinary capabilities of our model. WizardMath surpasses all other open-source LLMs by a substantial margin. Furthermore, our model even outperforms ChatGPT-3.5, Claude Instant-1, PaLM-2 and Minerva on GSM8k, simultaneously surpasses Text-davinci-002, PaLM-1 and GPT-3 on MATH. More details and model weights are public at https://github.com/nlpxucan/WizardLM and https://huggingface.co/WizardLM.

PDF Abstract

Essay on "WizardMath: Empowering Mathematical Reasoning for LLMs via Reinforced Evol-Instruct"

The paper "WizardMath: Empowering Mathematical Reasoning for LLMs via Reinforced Evol-Instruct" presents an innovative approach to enhancing the mathematical reasoning abilities of open-source LLMs, specifically using Llama-2 as the base. The authors propose the Reinforcement Learning from Evol-Instruct Feedback (RLEIF) method, aimed at addressing the limitations of existing LLMs in performing complex, multi-step quantitative reasoning tasks.

Methodology

The core technical innovation of the paper lies in the RLEIF methodology, which integrates evolved instruction data and reinforcement learning techniques to improve the reasoning capabilities of LLMs in mathematical contexts. The approach consists of three primary steps:

Supervised Fine-Tuning: The authors fine-tune Llama-2 using diverse instruction-response pairs derived from regenerated mathematical solutions and open-domain conversational data, ensuring the adaptability and coherence of the model.
Evol-Instruct Principles: The Evol-Instruct method is tailored to generate mathematical instructions with varying complexities. Downward evolution simplifies questions for easier understanding, while upward evolution increases question complexity for deeper reasoning. This method provides a robust framework for producing diverse questions that challenge the model’s reasoning capabilities.
Reinforced Process Supervision: Employing two reward models—Instruction Reward Model (IRM) and Process-supervised Reward Model (PRM)—the approach evaluates the quality and correctness of both the instructions and individual solutions. These models are coupled with PPO training to refine the model’s ability to generate accurate step-by-step solutions.

Experimental Results

The paper reports significant performance improvements across two well-regarded mathematical reasoning benchmarks: GSM8k and MATH. The results indicate that WizardMath achieves state-of-the-art performance among open-source LLMs, and in some instances outperforms notable close-source models such as GPT-3.5, ChatGPT, and Claude Instant.

On the GSM8k dataset, WizardMath demonstrates a substantial improvement with a pass@1 score of 81.6, surpassing previous models by notable margins.
On the MATH dataset, improvement is also observed, with WizardMath achieving a pass@1 score of 22.7.

These results showcase the efficacy of RLEIF in enhancing the model’s reasoning through carefully constructed instructions and reinforcement learning feedback.

Implications and Future Directions

The development of WizardMath marks a significant step forward in the domain of mathematical reasoning within open-source LLMs. Practically, such advances extend the capabilities of these models to educational settings and complex problem-solving applications where precise reasoning is critical.

Theoretically, this work suggests opportunities for further exploration in instruction generation and the reinforcement training of LLMs. Future work could involve refining the RLEIF mechanism or exploring alternative methods to further bolster reasoning capabilities, particularly as the complexity and volume of mathematical datasets continue to grow.

The paper avoids sensationalism, remaining grounded in its presentation of results and contributions. By providing comprehensive details and context, the work encourages a thoughtful exploration of methodological refinements in LLM mathematical reasoning—a field ripe for ongoing research and development.