Is PRM Necessary? Problem-Solving RL Implicitly Induces PRM Capability in LLMs (2505.11227v1)

Published 16 May 2025 in cs.AI and cs.LG

Abstract: The development of reasoning capabilities represents a critical frontier in LLMs research, where reinforcement learning (RL) and process reward models (PRMs) have emerged as predominant methodological frameworks. Contrary to conventional wisdom, empirical evidence from DeepSeek-R1 demonstrates that pure RL training focused on mathematical problem-solving can progressively enhance reasoning abilities without PRM integration, challenging the perceived necessity of process supervision. In this study, we conduct a systematic investigation of the relationship between RL training and PRM capabilities. Our findings demonstrate that problem-solving proficiency and process supervision capabilities represent complementary dimensions of reasoning that co-evolve synergistically during pure RL training. In particular, current PRMs underperform simple baselines like majority voting when applied to state-of-the-art models such as DeepSeek-R1 and QwQ-32B. To address this limitation, we propose Self-PRM, an introspective framework in which models autonomously evaluate and rerank their generated solutions through self-reward mechanisms. Although Self-PRM consistently improves the accuracy of the benchmark (particularly with larger sample sizes), analysis exposes persistent challenges: The approach exhibits low precision (<10\%) on difficult problems, frequently misclassifying flawed solutions as valid. These analyses underscore the need for continued RL scaling to improve reward alignment and introspective accuracy. Overall, our findings suggest that PRM may not be essential for enhancing complex reasoning, as pure RL not only improves problem-solving skills but also inherently fosters robust PRM capabilities. We hope these findings provide actionable insights for building more reliable and self-aware complex reasoning models.

Summary

The paper demonstrates that RL training for problem solving inherently induces strong PRM evaluation capabilities in LLMs.
It reveals that both problem-solving accuracy and PRM capabilities co-evolve during training, often outperforming explicit PRM approaches.
Self-PRM, an inference-time technique leveraging a model’s internal scoring, significantly boosts performance despite challenges in precision on tough tasks.

This paper investigates the relationship between standard reinforcement learning (RL) training for problem-solving and the emergence of process reward model (PRM) capabilities in LLMs. Contrary to the common belief that explicit process supervision (like PRM training) is necessary for developing robust reasoning abilities and the capacity to evaluate reasoning steps, the authors demonstrate that pure RL training, focused on achieving correct final answers, implicitly fosters strong PRM capabilities.

The research uses several state-of-the-art models, including DeepSeek-R1 and QwQ-32B, which are known for their strong performance on mathematical reasoning tasks, largely achieved through RL training without explicit process-level supervision. They evaluate these models, alongside instruction-tuned models and models explicitly fine-tuned on PRM data, on ProcessBench (2412.06559), a benchmark designed to assess the quality of intermediate reasoning steps.

Key empirical findings presented in the paper include:

RL Induces PRM Capability: Models trained primarily with RL on problem-solving objectives (like DeepSeek-R1 and QwQ-32B) exhibit high PRM capabilities, often outperforming models explicitly trained on PRM datasets (Table 1). This is measured by their ability to correctly judge the validity of reasoning steps or full solution traces. The results suggest that learning to solve complex problems through RL inherently leads to an understanding of how to solve them correctly, which translates into the ability to evaluate reasoning processes.
Co-evolution of Problem Solving and PRM Capability: Statistical analysis using chi-square tests shows a significant correlation between a model's problem-solving accuracy and its process judgment accuracy across different datasets (Table 2). Furthermore, tracking performance during RL training reveals that both problem-solving accuracy and ProcessBench F1 score (a measure of PRM capability) improve in parallel, and sometimes gains in process judgment precede or are more stable than final answer accuracy gains (Figure 1). This indicates that these two aspects of reasoning are deeply linked and co-evolve during RL training.
Limitations of External PRMs: The authors evaluate whether existing, separate PRMs can improve the performance of strong RL-trained models using Best-of-N (BoN) reranking. They find that an external PRM (Qwen2.5-Math-PRM-72B) provides little to no benefit compared to simple majority voting when applied to strong models like QwQ-32B and DeepSeek-R1 (Table 4, BoN w/ PRM vs Majority Voting). This suggests that external PRMs, potentially trained on different data or objectives, are not well-aligned with the internal reasoning processes of these highly capable RL-trained models.
Effectiveness of Self-PRM: To leverage the intrinsic PRM capability of RL-trained models, the authors propose Self-PRM, where the model uses its own internal scoring mechanism (or is prompted to evaluate) to rerank its sampled outputs in a BoN setup. BoN with Self-PRM consistently outperforms both Pass@k and Majority Voting, especially at larger sampling sizes (Table 4). This indicates that the model's own 'judgment' or internal reward signal is better aligned with its problem-solving process than external PRMs, making Self-PRM a more effective technique for improving performance via reranking for these models.
Limitations of Self-PRM Precision: Despite improving overall accuracy, a detailed analysis reveals that Self-PRM suffers from low precision on difficult problems (Table 5). Self-PRM tends to label a significant number of incorrect solutions as correct, especially on challenging benchmark instances. While stronger models like DeepSeek-R1 show better precision than weaker ones, the issue persists. This highlights that while RL induces PRM capability, the model's self-evaluation is not perfectly reliable, particularly when the task is beyond its current solving capacity.

Practical Implementation and Applications:

The findings have significant implications for developing and deploying high-capability reasoning LLMs:

Reduced Reliance on Process Supervision Data: The research suggests that investing heavily in collecting fine-grained process-level annotation data for training PRMs might be less necessary for models trained effectively with RL on outcome-based rewards. Pure RL training can implicitly build the desired process understanding. This can lead to substantial cost savings in data collection.
Prioritizing RL Scaling: Continued scaling of RL training for problem-solving seems to be a viable path not only for improving final answer accuracy but also for enhancing interpretable reasoning and process evaluation capabilities.
Self-PRM for Inference-Time Improvement: Self-PRM is a practical, inference-time technique. Instead of relying on a separate, potentially misaligned external PRM, developers can sample multiple solutions ( $N$ ) from the RL-trained model and use the model itself to evaluate or assign scores to these solutions. The solution with the highest score (or majority vote among highly-scored solutions) is then selected.

A basic pseudocode for implementing Self-PRM (BoN with Self-Reranking) could look like this:

function solve_with_self_prm(model, problem, N):
    solutions = []
    scores = []

    for i from 1 to N:
        # Generate a solution attempt
        solution_attempt = model.generate(problem, max_tokens=...)
        solutions.append(solution_attempt)

        # Prompt the model to evaluate the generated solution
        # This prompt could be like: "Here is a math problem and a proposed solution.
        # Please evaluate the solution step-by-step and provide a confidence score (e.g., 1-10)
        # or state if it is correct/incorrect based on your reasoning."
        evaluation_prompt = f"Problem: {problem}\nSolution: {solution_attempt}\nEvaluation:"
        evaluation_response = model.generate(evaluation_prompt, max_tokens=...)

        # Parse the evaluation response to extract a score or correctness judgment
        # This parsing logic is crucial and depends on the prompt format
        score = parse_score_from_evaluation(evaluation_response)
        scores.append(score)

    # Rerank solutions based on scores
    sorted_solutions_with_scores = sorted(zip(solutions, scores), key=lambda item: item[1], reverse=True)

    # Select the best solution (e.g., the one with the highest score)
    best_solution = sorted_solutions_with_scores[0][0]

    return best_solution

function parse_score_from_evaluation(evaluation_response):
    # Implement logic to extract score or boolean correctness from the model's text response
    # This might involve keyword matching, pattern recognition, or even another small model
    # Return a numeric score (higher is better) or boolean correctness
    pass # Placeholder implementation

Improving Self-PRM: The identified limitation of low precision on hard problems suggests that future work could focus on techniques to improve the reliability of the model's self-evaluation. This might involve training signals specifically aimed at improving introspective accuracy, perhaps by incorporating some form of weak or synthetic process supervision, or simply by continuing to scale RL training to achieve better internal consistency and reward alignment.

In conclusion, the paper provides strong evidence that pure RL training is a powerful approach for developing both problem-solving and process-evaluation abilities in LLMs, potentially diminishing the perceived necessity of explicit PRM training. Self-PRM emerges as a practical method to leverage this induced capability, although its limitations on challenging tasks highlight areas for further research in improving the introspection reliability of advanced reasoning models.

PDF Markdown

YouTube

Show All Videos

Is PRM Necessary? Problem-Solving RL Implicitly Induces PRM Capability in LLMs (2505.11227v1)

Summary

Related Papers

YouTube