- The paper demonstrates that RL training for problem solving inherently induces strong PRM evaluation capabilities in LLMs.
- It reveals that both problem-solving accuracy and PRM capabilities co-evolve during training, often outperforming explicit PRM approaches.
- Self-PRM, an inference-time technique leveraging a model’s internal scoring, significantly boosts performance despite challenges in precision on tough tasks.
This paper investigates the relationship between standard reinforcement learning (RL) training for problem-solving and the emergence of process reward model (PRM) capabilities in LLMs. Contrary to the common belief that explicit process supervision (like PRM training) is necessary for developing robust reasoning abilities and the capacity to evaluate reasoning steps, the authors demonstrate that pure RL training, focused on achieving correct final answers, implicitly fosters strong PRM capabilities.
The research uses several state-of-the-art models, including DeepSeek-R1 and QwQ-32B, which are known for their strong performance on mathematical reasoning tasks, largely achieved through RL training without explicit process-level supervision. They evaluate these models, alongside instruction-tuned models and models explicitly fine-tuned on PRM data, on ProcessBench (2412.06559), a benchmark designed to assess the quality of intermediate reasoning steps.
Key empirical findings presented in the paper include:
- RL Induces PRM Capability: Models trained primarily with RL on problem-solving objectives (like DeepSeek-R1 and QwQ-32B) exhibit high PRM capabilities, often outperforming models explicitly trained on PRM datasets (Table 1). This is measured by their ability to correctly judge the validity of reasoning steps or full solution traces. The results suggest that learning to solve complex problems through RL inherently leads to an understanding of how to solve them correctly, which translates into the ability to evaluate reasoning processes.
- Co-evolution of Problem Solving and PRM Capability: Statistical analysis using chi-square tests shows a significant correlation between a model's problem-solving accuracy and its process judgment accuracy across different datasets (Table 2). Furthermore, tracking performance during RL training reveals that both problem-solving accuracy and ProcessBench F1 score (a measure of PRM capability) improve in parallel, and sometimes gains in process judgment precede or are more stable than final answer accuracy gains (Figure 1). This indicates that these two aspects of reasoning are deeply linked and co-evolve during RL training.
- Limitations of External PRMs: The authors evaluate whether existing, separate PRMs can improve the performance of strong RL-trained models using Best-of-N (BoN) reranking. They find that an external PRM (Qwen2.5-Math-PRM-72B) provides little to no benefit compared to simple majority voting when applied to strong models like QwQ-32B and DeepSeek-R1 (Table 4, BoN w/ PRM vs Majority Voting). This suggests that external PRMs, potentially trained on different data or objectives, are not well-aligned with the internal reasoning processes of these highly capable RL-trained models.
- Effectiveness of Self-PRM: To leverage the intrinsic PRM capability of RL-trained models, the authors propose Self-PRM, where the model uses its own internal scoring mechanism (or is prompted to evaluate) to rerank its sampled outputs in a BoN setup. BoN with Self-PRM consistently outperforms both Pass@k and Majority Voting, especially at larger sampling sizes (Table 4). This indicates that the model's own 'judgment' or internal reward signal is better aligned with its problem-solving process than external PRMs, making Self-PRM a more effective technique for improving performance via reranking for these models.
- Limitations of Self-PRM Precision: Despite improving overall accuracy, a detailed analysis reveals that Self-PRM suffers from low precision on difficult problems (Table 5). Self-PRM tends to label a significant number of incorrect solutions as correct, especially on challenging benchmark instances. While stronger models like DeepSeek-R1 show better precision than weaker ones, the issue persists. This highlights that while RL induces PRM capability, the model's self-evaluation is not perfectly reliable, particularly when the task is beyond its current solving capacity.
Practical Implementation and Applications:
The findings have significant implications for developing and deploying high-capability reasoning LLMs:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
|
function solve_with_self_prm(model, problem, N):
solutions = []
scores = []
for i from 1 to N:
# Generate a solution attempt
solution_attempt = model.generate(problem, max_tokens=...)
solutions.append(solution_attempt)
# Prompt the model to evaluate the generated solution
# This prompt could be like: "Here is a math problem and a proposed solution.
# Please evaluate the solution step-by-step and provide a confidence score (e.g., 1-10)
# or state if it is correct/incorrect based on your reasoning."
evaluation_prompt = f"Problem: {problem}\nSolution: {solution_attempt}\nEvaluation:"
evaluation_response = model.generate(evaluation_prompt, max_tokens=...)
# Parse the evaluation response to extract a score or correctness judgment
# This parsing logic is crucial and depends on the prompt format
score = parse_score_from_evaluation(evaluation_response)
scores.append(score)
# Rerank solutions based on scores
sorted_solutions_with_scores = sorted(zip(solutions, scores), key=lambda item: item[1], reverse=True)
# Select the best solution (e.g., the one with the highest score)
best_solution = sorted_solutions_with_scores[0][0]
return best_solution
function parse_score_from_evaluation(evaluation_response):
# Implement logic to extract score or boolean correctness from the model's text response
# This might involve keyword matching, pattern recognition, or even another small model
# Return a numeric score (higher is better) or boolean correctness
pass # Placeholder implementation |
- Improving Self-PRM: The identified limitation of low precision on hard problems suggests that future work could focus on techniques to improve the reliability of the model's self-evaluation. This might involve training signals specifically aimed at improving introspective accuracy, perhaps by incorporating some form of weak or synthetic process supervision, or simply by continuing to scale RL training to achieve better internal consistency and reward alignment.
In conclusion, the paper provides strong evidence that pure RL training is a powerful approach for developing both problem-solving and process-evaluation abilities in LLMs, potentially diminishing the perceived necessity of explicit PRM training. Self-PRM emerges as a practical method to leverage this induced capability, although its limitations on challenging tasks highlight areas for further research in improving the introspection reliability of advanced reasoning models.