Let's Verify Step by Step (2305.20050v1)

Published 31 May 2023 in cs.LG, cs.AI, and cs.CL

Abstract: In recent years, LLMs have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. Recent work has already begun this comparison, but many questions still remain. We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. Our process-supervised model solves 78% of problems from a representative subset of the MATH test set. Additionally, we show that active learning significantly improves the efficacy of process supervision. To support related research, we also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model.

Citations (461)

View on Semantic Scholar

Summary

The paper demonstrates that process supervision outperforms outcome supervision by achieving a 78% problem-solving rate on the MATH dataset.
It employs active learning to improve data efficiency by 2.6x, reducing the need for extensive human feedback.
The study releases the PRM800K dataset to foster further research on reliable and interpretable LLM alignment.

An Analysis of "Let's Verify Step by Step"

LLMs have significantly advanced in solving tasks requiring complex, multi-step reasoning. However, they are not immune to logical mistakes, often termed hallucinations, which can derail entire solutions. The paper "Let's Verify Step by Step" tackles this issue by comparing two methods for training reward models: outcome supervision and process supervision. The paper concludes that process supervision significantly outperforms outcome supervision in training models to solve problems from the challenging MATH dataset. Additionally, they introduce active learning as a technique to enhance data efficiency in process supervision. This essay will explore the methodology, results, implications, and potential future directions of this research.

Core Contributions

The paper makes several key contributions:

Superior Performance of Process Supervision: Process supervision was shown to significantly outperform outcome supervision. The process-supervised model achieved a performance metric of solving 78.2% of problems from a representative subset of the MATH test set. This is notable when compared to outcome-supervised models, which tend to misattribute errors in solutions.
Data Efficiency through Active Learning: The authors demonstrated that active learning improves the data efficiency of process supervision by a factor of 2.6x. This finding is crucial as it suggests a way to reduce the reliance on extensive human feedback, which can be resource-intensive.
Dataset Release: The release of PRM800K, a comprehensive dataset containing 800,000 step-level human feedback labels, aims to facilitate further research in the alignment and training reliability of LLMs.

Methodological Details

The research employs a rigorous methodological framework:

Comparison Regimes:

The experiments were conducted in two regimes—large-scale and small-scale. At large-scale, models were fine-tuned from GPT-4. This setup aimed at advancing state-of-the-art models but was not directly comparable due to differences in training sets. Consequently, small-scale comparisons used a large model for synthetic supervision to ensure fair comparisons.

Training and Testing:

All large-scale models were fine-tuned from GPT-4 without pretraining with RLHF. Small-scale models were pretrained with significantly less compute but on a specialized dataset called MathMix, enhancing mathematical reasoning capabilities.

Data Collection:

Process supervision required human data-labelers to provide step-level labels for model-generated solutions. The labeling was biased towards incorrect solutions to maximize data utility, a strategy validated through several surrogate ablations.

Results and Discussions

Performance Insights

The results were clear:

Process-Supervised Models:

These models outperformed outcome-supervised models significantly in both large-scale and small-scale setups. They demonstrated higher reliability in selecting correct solutions when evaluating best-of-N samples.

Outcome-Supervised Models:

Although the ORM performed close to majority voting, it was overshadowed by the superior performance of PRMs. The outcome supervision was less effective primarily due to its struggle with credit assignment in multi-step reasoning problems.

Alignment and Safety

The paper discusses several advantages of process supervision in terms of AI alignment:

Interpretable Reasoning:

Since process supervision rewards human-endorsed steps explicitly, the resulting models are more interpretable and their reasoning easier to scrutinize.

Safety:

By directly rewarding aligned chains-of-thought rather than relying on potentially misleading outcome-based signals, it mitigates risks associated with the proxy alignment.

Implications and Future Directions

The implications of this research are profound:

For Training LLMs: The findings suggest that employing process supervision can yield more reliable and aligned models, particularly in domains requiring complex reasoning.
For Data Efficiency: Active learning presents a significant avenue for reducing the cost of human feedback, making the scaling of such techniques practical.
For Broader AI Research: The release of PRM800K is likely to enable subsequent research, potentially investigating the impact of process supervision in other domains.

Conclusion

"Let's Verify Step by Step" provides compelling evidence that process supervision can train more reliable reward models compared to outcome supervision in the context of mathematical reasoning tasks. This research emphasizes the importance of fine-grained feedback and presents a scalable way to collect valuable human data. The implications for AI alignment and model interpretability underscore the importance of these techniques in advancing safe and reliable AI systems. Future work should investigate the generalizability of process supervision to other domains and further refine active learning strategies.