Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 154 tok/s
Gemini 2.5 Pro 44 tok/s Pro
GPT-5 Medium 33 tok/s Pro
GPT-5 High 27 tok/s Pro
GPT-4o 110 tok/s Pro
Kimi K2 191 tok/s Pro
GPT OSS 120B 450 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

Math-Shepherd: Process-Oriented Reward Modeling

Updated 9 October 2025
  • Math-Shepherd is a process-oriented reward modeling system that evaluates intermediate reasoning steps in multi-stage math solutions, enabling precise feedback.
  • It uses automatic process supervision through continuation sampling and sigmoid-based scoring to eliminate manual annotations and boost accuracy.
  • The system integrates stepwise reinforcement learning via PPO and verification with self-consistency, significantly enhancing performance on GSM8K and MATH benchmarks.

Math-Shepherd is an advanced process-oriented reward modeling system designed to verify and reinforce mathematical reasoning in LLMs without requiring manual human annotation (Wang et al., 2023). Its architecture centers on assigning reward scores at the level of reasoning steps within multi-stage math solutions, enabling precise feedback and step-by-step guidance during both inference (verification) and training (reinforcement learning). Math-Shepherd thus represents a significant evolution in the development of autonomously supervised mathematical reasoning agents.

1. Process-Oriented Reward Model Architecture

Traditional outcome reward models (ORM) in LLM fine-tuning yield a scalar score only after an entire solution is produced. Math-Shepherd introduces a process reward model (PRM) that evaluates and guides the generator at each intermediate reasoning step. For a solution decomposed into steps {s1,s2,,sK}\{s_1, s_2, \dots, s_K\}, the model rsir_{s_i} outputs a sigmoid-based “goodness” score for each sis_i.

Training data labels for each step are constructed automatically, leveraging continuation sampling rather than human annotation. Given NN full continuations (completions) generated from a step sis_i using a completer model, the scoring functions are:

  • Hard Estimation (HE): ysi(HE)=1y^{\text{(HE)}}_{s_i} = 1 if any sampled continuation reaches the correct final answer, $0$ otherwise.
  • Soft Estimation (SE): ysi(SE)=1Nj=1NI(aj=a)y^{\text{(SE)}}_{s_i} = \frac{1}{N}\sum_{j=1}^N \mathbb{I}(a_j = a^*).

The PRM is trained using cross-entropy loss:

LPRM=i=1K[ysilogrsi+(1ysi)log(1rsi)],\mathcal{L}_{\text{PRM}} = \sum_{i=1}^K \Big[ y_{s_i} \log r_{s_i} + (1 - y_{s_i}) \log (1 - r_{s_i}) \Big],

where ysiy_{s_i} is obtained by either HE or SE and rsir_{s_i} is the predicted reward for sis_i.

2. Verification as Stepwise Reranking

Math-Shepherd operates as a verifier by reranking candidate solutions from an LLM generator. Given a math problem pp and NN candidate solutions {S1,,SN}\{S_1,\ldots,S_N\}, Math-Shepherd computes the solution score (for solution SiS_i) as minirsi\min_i r_{s_i}, where rsir_{s_i} is the per-step reward across all steps in SiS_i.

To increase reliability, the verifier can be composed with self-consistency voting:

a(sc+rm)=argmaxai=1NI(ai=a)RM(p,Si),a_{\text{(sc+rm)}} = \arg\max_a \sum_{i=1}^N \mathbb{I}(a_i = a) \cdot RM(p, S_i),

where RM(p,Si)=minirsiRM(p, S_i) = \min_i r_{s_i} and aia_i is the final answer of SiS_i. This compositional strategy ensures that the selected solution exhibits high intermediate step quality and majority final answer consensus.

3. Stepwise Reinforcement Learning via PPO

For reinforcement learning, Math-Shepherd supplies the stepwise reward in the context of Proximal Policy Optimization (PPO). Rather than a single episode-level reward, Math-Shepherd delivers rsir_{s_i} immediately after each step sis_i in the generated solution. The PPO update then uses these dense rewards:

  • Policy loss is calculated to maximize expected reward at each reasoning transition.
  • The fine-tuned generator incrementally corrects and optimizes the chain of reasoning, improving stepwise logic and solution reliability.

Empirically, PPO with Math-Shepherd produces significant performance boosts: for example, Mistral-7B accuracy on GSM8K (a mathematical word problem dataset) moves from 77.9% to 84.1%; on MATH (complex math problems), accuracy advances from 28.6% to 33.0%. Verification with Math-Shepherd further amplifies performance, to 89.1% (GSM8K) and 43.5% (MATH), respectively.

4. Automatic Process Supervision via Completion Sampling

Eliminating the need for human labelers, Math-Shepherd exploits the generative capacity of LLMs to build its training set. By deploying a completer LLM to sample full solutions from any intermediate step sis_i (for N samples per step), it automatically infers ground-truth labels for step validity. This method leverages the notion that the local correctness of sis_i can be extrapolated from the solution trajectories it enables.

This “process-wise supervision” methodology scales seamlessly, enabling PRMs to be trained on large datasets for a variety of mathematical domains—without the prohibitive cost, latency, or inconsistency associated with manual stepwise annotation.

5. Quantitative Results and Comparative Evaluation

Math-Shepherd’s efficacy was validated on open-source models such as DeepSeek-67B, LLaMA2-70B, and LLemma-34B, evaluated with GSM8K and MATH. Table 1 presents representative results:

Model GSM8K (Verification) MATH (Verification)
DeepSeek-67B (256 candidates) 93.3% 48.1%
Mistral-7B (stepwise PPO) 84.1% 33.0%
Mistral-7B (w/ verification) 89.1% 43.5%

Math-Shepherd consistently outperforms verification baselines using either self-consistency voting or ORM-based reranking. The system demonstrates robustness to hallucinated steps, error propagation, and can reliably filter high-quality chains from diverse candidate pools.

6. Implications and Future Directions

Math-Shepherd offers a scalable, automatic process supervision pipeline that is adaptable to a wide array of mathematical reasoning regimes in LLMs. The ability to train and supervise without manual annotation removes a fundamental bottleneck, particularly for stepwise mathematical reasoning tasks where human annotation is otherwise impractical.

Looking forward, Math-Shepherd’s paradigm can be generalized to other domains involving multi-step reasoning, multi-modal math problems (see (Wang et al., 28 Feb 2025) for multi-visual scenarios), and further improved using richer completion models or more granular stepwise reward estimation. Its step-level feedback and reinforcement learning integration provide an extensible foundation for self-improving, interpretable mathematical AI systems.

7. Key Mathematical Formulas

Representative training losses:

  • Outcome Reward Model (ORM): LORM=yslogrs+(1ys)log(1rs)\mathcal{L}_{\text{ORM}} = y_s \log r_s + (1 - y_s)\log(1 - r_s)
  • Process Reward Model (PRM): LPRM=i=1K[ysilogrsi+(1ysi)log(1rsi)]\mathcal{L}_{\text{PRM}} = \sum_{i=1}^K [ y_{s_i}\log r_{s_i} + (1 - y_{s_i})\log(1 - r_{s_i}) ]

Verifier selection score:

RM(p,Si)=minsjSirsjRM(p, S_i) = \min_{s_j \in S_i} r_{s_j}

Self-consistency composition with process verification:

a(sc+rm)=argmaxai=1NI(ai=a)RM(p,Si)a_{\text{(sc+rm)}} = \arg\max_a \sum_{i=1}^N \mathbb{I}(a_i = a) \cdot RM(p, S_i)

8. Summary

Math-Shepherd builds a process-level evaluation and training pipeline for mathematical reasoning in LLMs, with automatic, scalable construction of supervision, effective inference-time verification, and reinforcement learning using granular reward feedback. Its integration dramatically boosts solution accuracy and reliability in widely used LLMs, suggesting that process-oriented reward modeling is a critical advance for building robust and interpretable mathematical AI agents.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Math-Shepherd.