- The paper systematically compares process- and outcome-based supervision for solving math word problems, highlighting trade-offs in label use and reasoning quality.
- Outcome-based methods achieve competitive final-answer error reductions, while process-based supervision ensures superior trace accuracy.
- Combining supervised fine-tuning, reinforcement learning, and reward models, the approach reduces final-answer error from 16.8% to 12.7% and trace error from 14.0% to 3.4%.
Evaluating Process- and Outcome-Based Supervision for Math Word Problems
This paper presents a detailed evaluation of process-based and outcome-based supervisory approaches for training LLMs (LMs) to solve math word problems, particularly using the GSM8K dataset. The primary contribution stems from a systematic comparison of these approaches, focusing on two distinct error metrics: final-answer errors and trace errors. The paper elucidates the trade-offs inherent in each supervisory approach, particularly in terms of the efficiency of label use and the quality of the model's reasoning process.
Summary and Methodology
The authors train and evaluate LMs using various combinations of supervised fine-tuning (SFT), reinforcement learning (RL) with expert iteration, and reward models (RMs). The process-based supervision relies on reasoning traces and evaluations of each step, while the outcome-based supervision assesses only the final answer's correctness. The latter is posited as label-efficient, requiring minimal supervision per question. An RM framework enhances the models by prioritizing sequences that maximize correct outcomes.
Key Findings:
- Final-Answer Error: Models trained under outcome-based supervision using RL or RMs achieve competitive final-answer error rates compared to process-based SFT approaches, indicating that assessing the final result is often sufficient to drive improvements in answer accuracy.
- Trace Error: The models employing process-based supervision or emulated process feedback (PRM) through reward models (ORM) demonstrate superior trace error rates, underscoring their efficacy in ensuring that the reasoning steps align with human expectations. A significant conclusion is that ORM-trained models often approximate the performance of PRM-trained models in trace correctness.
- Quantitative Performance: The combination of supervised learning with reward-model-based reinforcement learning sets a new benchmark, reducing the final-answer error from 16.8% to 12.7% and trace error from 14.0% to 3.4%.
Implications and Future Directions
One of the key insights from this paper is the nuanced role of RMs trained with outcome-based labels. Despite being exposed to outcome information, these models appear to implicitly learn process-based cues, achieving trace performance closer to models trained with explicit process-based supervision. This finding highlights the potential for these models to bridge the gap between process and outcome information, leveraging the strengths of both approaches.
From a practical perspective, the research suggests that context should guide supervisory approach choice. Where final correctness suffices, outcome-based methods reign due to their efficiency. Conversely, in domains where the interpretability of reasoning is critical, process-based supervision or its approximations are indispensable.
Theoretically, these results advocate for a refined understanding of how supervisory signals propagate through reinforcement learning and reward modeling architectures. This understanding could inform the development of more sophisticated algorithms that dynamically incorporate process and outcome feedback.
Conclusion
This paper provides a rigorous and insightful analysis of how different supervisory approaches influence the performance of LMs on complex reasoning tasks like math problem-solving. The results challenge the simplistic dichotomy between process and outcome approaches, showing that both can usefully inform model training, depending on the goals and constraints of the problem domain. Future research may explore this dual-feedback paradigm further, investigating the broader applicability to dynamic and less structured domains beyond mathematical reasoning.