Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations (2312.08935v3)

Published 14 Dec 2023 in cs.AI, cs.CL, and cs.LG

Abstract: In this paper, we present an innovative process-oriented math process reward model called \textbf{Math-Shepherd}, which assigns a reward score to each step of math problem solutions. The training of Math-Shepherd is achieved using automatically constructed process-wise supervision data, breaking the bottleneck of heavy reliance on manual annotation in existing work. We explore the effectiveness of Math-Shepherd in two scenarios: 1) \textit{Verification}: Math-Shepherd is utilized for reranking multiple outputs generated by LLMs; 2) \textit{Reinforcement Learning}: Math-Shepherd is employed to reinforce LLMs with step-by-step Proximal Policy Optimization (PPO). With Math-Shepherd, a series of open-source LLMs demonstrates exceptional performance. For instance, the step-by-step PPO with Math-Shepherd significantly improves the accuracy of Mistral-7B (77.9\%$\to$84.1\% on GSM8K and 28.6\%$\to$33.0\% on MATH). The accuracy can be further enhanced to 89.1\% and 43.5\% on GSM8K and MATH with the verification of Math-Shepherd, respectively. We believe that automatic process supervision holds significant potential for the future evolution of LLMs.

PDF HTML Abstract

Automatic Process Annotation for Enhancing Mathematical Reasoning in LLMs

Introduction

The challenge of accurately solving complex multi-step mathematical problems is a substantial one for current LLMs. Despite their impressive capabilities across various tasks, the nuanced and sequential nature of mathematical reasoning presents a distinctive challenge. Prior research has certainly made strides in this domain through methodologies spanning pre-training, fine-tuning, and verification, but the prospect of verification has recently taken center stage. Specifically, Process Reward Models (PRMs) have emerged as a promising avenue due to their capability to assess the reasoning path step-by-step, akin to the human process of problem-solving. However, the lack of automated processes for data annotation has remained a bottleneck. This paper introduces an innovative framework titled MATH-SHEPHERD, which leverages automatic process annotation to significantly reduce the dependency on manual data annotation, hence enhancing the LLMs' capability in mathematical reasoning.

Existing Limitations

The reliance on manual annotation for training PRMs is both cost-prohibitive and scaling-impaired, limiting the practical applicability and development pace of PRMs in mathematical reasoning tasks. Current verification models largely fall into two categories: Outcome Reward Models (ORMs) and PRMs. PRMs, despite their potential, have been hindered by the high costs and complexity of obtaining process-wise human annotations, especially for intricate multi-step reasoning tasks that demand advanced skills from annotators.

MATH-SHEPHERD Framework

MATH-SHEPHERD stands out by automating the annotation process, significantly enhancing the scalability and efficiency of training PRMs. Inspired by Monte Carlo Tree Search principles, it assesses the quality of each intermediate reasoning step based on its potential to deduce the correct final answer. This process involves an automatically fine-tuned LLM to generate multiple subsequent reasoning paths from a given step and validate them against the correct answer. Steps leading to accurate conclusions are thereby assigned higher correctness scores. The key contributions of the framework include:

A method to automatically generate process supervision datasets for mathematical reasoning tasks without necessitating human annotations.
Demonstrable superior performance across benchmark datasets GSM8K and MATH, using a series of open-source LLMs ranging in size from 7B to 70B parameters.
Empirical analysis identifying the crucial factors in training an efficient verifier, thereby providing insights into future directions for enhancing reasoning capabilities in LLMs through intermediate supervision.

Dataset and Methodology

The framework was evaluated using two benchmark datasets: GSM8K and MATH. Leveraging automatically constructed process-wise supervision data, MATH-SHEPHERD facilitated the training of PRMs across a spectrum of model sizes (from 7B to 70B). Remarkably, DeepSeek 67B, when coupled with MATH-SHEPHERD, achieved unprecedented accuracy of 93.3% and 48.1% on the GSM8K and MATH datasets, respectively, without additional external aids.

Implications and Future Directions

MATH-SHEPHERD represents a significant stride towards resolving the limitations imposed by manual process annotation in mathematical reasoning tasks for LLMs. The framework not only demonstrates the viability of automatic process supervision as a scalable and efficient alternative but also paves the way for future research in amalgamating the capabilities of LLMs with advanced verification models like PRMs. Moreover, the remarkable improvement in performance across benchmark datasets underscores the potential of automated process annotation in elevating the reasoning capabilities of LLMs. Going forward, exploring the integration of such frameworks within reinforcement learning processes to further boost LLM accuracy in top-1 outcomes, along with the pursuit of a generalized PRM for mathematics, delineates an exciting trajectory for future research in the domain.