Process Reward Models That Think (2504.16828v3)

Published 23 Apr 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Step-by-step verifiers -- also known as process reward models (PRMs) -- are a key ingredient for test-time scaling. PRMs require step-level supervision, making them expensive to train. This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT). We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs. Our approach capitalizes on the inherent reasoning abilities of long CoT models, and outperforms LLM-as-a-Judge and discriminative verifiers -- using only 1% of the process labels in PRM800K -- across several challenging benchmarks. Specifically, ThinkPRM beats the baselines on ProcessBench, MATH-500, and AIME '24 under best-of-N selection and reward-guided search. In an out-of-domain evaluation on a subset of GPQA-Diamond and LiveCodeBench, our PRM surpasses discriminative verifiers trained on the full PRM800K by 8% and 4.5%, respectively. Lastly, under the same token budget, ThinkPRM scales up verification compute more effectively compared to LLM-as-a-Judge, outperforming it by 7.2% on a subset of ProcessBench. Our work highlights the value of generative, long CoT PRMs that can scale test-time compute for verification while requiring minimal supervision for training. Our code, data, and models will be released at https://github.com/mukhal/thinkprm.

Summary

The paper introduces ThinkPRM, a generative process reward model that verifies step-by-step solutions via a detailed chain-of-thought, achieving significant data efficiency.
The paper leverages lightweight fine-tuning on synthetic verification chains to reduce dependency on extensive human annotations and lower training costs.
The paper demonstrates that ThinkPRM can dynamically scale verification compute, outperforming baseline models on benchmarks like ProcessBench and MATH-500.

The paper "Process Reward Models That Think" (2504.16828) introduces ThinkPRM, a novel approach to building process reward models (PRMs) that are data-efficient and highly effective for guiding reasoning in LLMs. Traditional PRMs are often discriminative classifiers requiring vast amounts of expensive step-level human annotations. In contrast, ThinkPRM is a generative PRM that verifies step-by-step solutions by generating a detailed verification chain-of-thought (CoT). This approach leverages the inherent reasoning capabilities of LLMs and allows for training with significantly fewer process labels.

ThinkPRM is built upon open-weight reasoning models, finetuned efficiently using a small dataset of synthetic verification CoTs. The core idea is to train the model to verbalize its step-by-step verification process, producing an extended CoT similar to a human thinking aloud. This verbalization provides interpretability and enables dynamic scaling of verification compute at test time.

The implementation of ThinkPRM involves lightweight finetuning of pre-trained reasoning models (such as R1-Distill-Qwen variants and QwQ-32B-Preview). The training data consists of synthetic verification chains collected by prompting a capable reasoning model (QwQ-32B-Preview) to critique step-by-step solutions from the PRM800K dataset [lightman2023let], which provides gold step-level labels. A crucial step in the data collection pipeline is filtering the generated CoTs: only those that follow a specific format, match the gold step labels, and are within a reasonable length (to avoid issues like overthinking or looping) are kept. This filtering process, based on process-level correctness rather than just the final outcome, is shown to be critical for training effective verifiers. The paper demonstrates that training on just 1K such filtered chains (corresponding to about 8K process labels) is sufficient for significant performance gains. The finetuning process itself is highly efficient, taking only a few hours on a single GPU for the largest models.

The paper evaluates ThinkPRM against two main baselines: LLM-as-a-Judge (the base reasoning model used zero-shot as a verifier) and discriminative PRMs trained on the full PRM800K dataset (orders of magnitude more data). Key practical findings include:

LLM-as-a-Judge limitations: Zero-shot reasoning models used as verifiers suffer from significant issues, including sensitivity to prompt wording, generating invalid outputs (malformed responses, overthinking, infinite loops), and generally poor performance compared to specialized models.
Effectiveness of Finetuning: Lightweight finetuning on the filtered synthetic data substantially improves the reliability and performance of generative verifiers, drastically reducing invalid outputs and boosting accuracy on verification tasks like ProcessBench [zheng2024processbench].
Data Efficiency and Performance: ThinkPRM, trained on only 8K process labels, consistently outperforms discriminative PRMs trained on 712K process labels (about 100x more data) across various evaluation scenarios. This highlights the advantage of the generative approach and leveraging the base model's LLMing capabilities.
Test-Time Scaling: ThinkPRM is effective in test-time scaling techniques like Best-of-N selection and verifier-guided beam search, outperforming baselines on benchmarks like MATH-500 [hendrycks2021measuring] and AIME '24.
Out-of-Domain Generalization: Despite being trained primarily on mathematical data, ThinkPRM shows strong generalization capabilities on out-of-domain tasks like science QA (GPQA-Diamond [rein2024gpqa]) and code generation (LiveCodeBench [jain2024livecodebench]), surpassing discriminative verifiers trained on much larger datasets which tend to be more sensitive to domain shifts.
Scaling Verifier Compute: Generative PRMs uniquely allow for scaling verification compute. The paper explores two methods: parallel scaling (sampling multiple CoTs and averaging scores) and sequential scaling (forcing the model to double-check its verification). Both methods improve performance, showing that ThinkPRM can effectively utilize additional compute, outperforming LLM-as-a-Judge under similar compute budgets.
Handling Difficulty: ThinkPRM demonstrates particular effectiveness on more challenging reasoning problems, where its ability to generate detailed reasoning for verification provides a distinct advantage over discriminative models.

Implementation considerations include the computational overhead of generating long CoTs compared to discriminative models. While training is efficient, inference latency for verification is higher. The paper also notes limitations like potential overconfidence in predicted scores and "step label interference," where errors in verifying early steps can negatively impact verification of later steps in the solution. Future work could explore better calibration methods and techniques to mitigate the autoregressive error propagation.

Overall, ThinkPRM presents a practical and promising direction for building high-quality, data-efficient PRMs by harnessing the generative power and reasoning capabilities of LLMs, enabling effective scaling of test-time compute for complex reasoning tasks. The code, data, and models are planned to be released to facilitate further research and application.