Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
131 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Process Reinforcement through Implicit Rewards (2502.01456v1)

Published 3 Feb 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of LLMs, particularly in tasks requiring complex multi-step reasoning. While dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards, such as training efficiency and credit assignment, this potential remains largely unrealized. This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive, making them particularly vulnerable to reward hacking. To address these challenges, we propose PRIME (Process Reinforcement through IMplicit rEwards), which enables online PRM updates using only policy rollouts and outcome labels through implict process rewards. PRIME combines well with various advantage functions and forgoes the dedicated reward model training phrase that existing approaches require, substantially reducing the development overhead. We demonstrate PRIME's effectiveness on competitional math and coding. Starting from Qwen2.5-Math-7B-Base, PRIME achieves a 15.1% average improvement across several key reasoning benchmarks over the SFT model. Notably, our resulting model, Eurus-2-7B-PRIME, surpasses Qwen2.5-Math-7B-Instruct on seven reasoning benchmarks with 10% of its training data.

Summary

  • The paper introduces PRIME, a framework that uses implicit rewards for dense, token-level feedback in LLM reinforcement learning.
  • It integrates online PRM updates using outcome labels, eliminating the need for explicit process supervision and mitigating reward hacking.
  • Experiments demonstrate that PRIME accelerates training and improves sample efficiency while outperforming outcome-only RL on complex reasoning benchmarks.

Introduction to PRIME

Process-level reinforcement learning for LLMs seeks to leverage dense, intermediate feedback signals rather than relying solely on sparse, outcome-level rewards. While dense rewards theoretically offer advantages in sample efficiency and credit assignment, practical application has been hindered by the prohibitive cost of acquiring fine-grained process labels and the challenges of updating Process Reward Models (PRMs) online to prevent reward hacking. The PRIME (Process Reinforcement through IMplicit rEwards) framework addresses these limitations by enabling online PRM updates using only readily available outcome labels through the mechanism of implicit process rewards (2502.01456). This approach avoids the need for explicit process supervision and dedicated PRM pre-training phases.

Implicit Process Rewards and Online Updates

The core innovation of PRIME lies in its use of an Implicit Process Reward Model (pi_phi), which is trained similarly to a standard Outcome Reward Model (ORM). This model is trained to predict the final outcome reward, ro(y)r_o(\mathbf{y}), given the entire generated sequence y\mathbf{y}. Specifically, it optimizes a loss function (e.g., cross-entropy for binary outcomes) based on pairs (y,ro(y))(\mathbf{y}, r_o(\mathbf{y})). The reward function associated with this implicit PRM is defined relative to a reference model πref\pi_{\text{ref}}:

rϕ(y):=βlog(πϕ(y)πref(y))r_\phi(\mathbf{y}) := \beta \log\left(\frac{\pi_\phi(\mathbf{y})}{\pi_{\text{ref}}(\mathbf{y})}\right)

Although trained only on sequence-level outcome labels ro(y)r_o(\mathbf{y}), the implicit PRM πϕ\pi_\phi allows for the extraction of token-level implicit process rewards during the RL phase:

rϕ(yt):=βlog(πϕ(yty<t)πref(yty<t))r_\phi(y_t) := \beta \log\left(\frac{\pi_\phi(y_t|\mathbf{y}_{<t})}{\pi_{\text{ref}}(y_t|\mathbf{y}_{<t})}\right)

Here, rϕ(yt)r_\phi(y_t) represents the reward contribution of generating token yty_t at step tt. This formulation provides dense, token-level feedback without requiring explicit intermediate annotations.

Crucially, because the training objective for πϕ\pi_\phi relies only on outcome labels ro(y)r_o(\mathbf{y}) – which can be obtained from the environment, rule-based verifiers, or a static ORM during RL – the implicit PRM πϕ\pi_\phi can be updated online alongside the policy πθ\pi_\theta. This online update process involves collecting rollouts (y,ro(y))(\mathbf{y}, r_o(\mathbf{y})) from the current policy πθ\pi_\theta and using them to fine-tune πϕ\pi_\phi. This continuous adaptation of the reward model helps mitigate distribution shift between the policy and the reward model, thereby reducing the risk of reward hacking, a common issue when using static reward models.

Advantage Estimation and Policy Optimization

PRIME integrates both the dense implicit process rewards rϕ(yt)r_\phi(y_t) and the sparse outcome reward ro(y)r_o(\mathbf{y}) into the advantage estimation. The framework is compatible with various advantage estimators. The paper finds particular success using Monte-Carlo estimators with a leave-one-out baseline (RLOO), which estimates the advantage for sample ii using the average return over K1K-1 other samples in the batch. The advantage for sample ii at timestep tt is calculated by summing the advantages derived from the process and outcome rewards:

Ati=(k=tTγktrϕ(yki)BaselineRLOO(rϕ,t))+(γTtro(yi)BaselineRLOO(ro))A^i_t = \left( \sum_{k=t}^{T} \gamma^{k-t} r_\phi(y_k^i) - \text{Baseline}_{\text{RLOO}}(r_\phi, t) \right) + \left( \gamma^{T-t} r_o(\mathbf{y}^i) - \text{Baseline}_{\text{RLOO}}(r_o) \right)

where γ\gamma is the discount factor. The policy πθ\pi_\theta is then updated using a policy gradient method, such as the PPO clipped surrogate objective:

LCLIP(θ)=E^t[min(ρt(θ)At,clip(ρt(θ),1ϵ,1+ϵ)At)]\mathcal{L}_{\text{CLIP}}(\theta) = \hat{\mathbb{E}}_t \left[ \min( \rho_t(\theta) A_t, \text{clip}(\rho_t(\theta), 1-\epsilon, 1+\epsilon) A_t ) \right]

where ρt(θ)=πθ(atst)πθold(atst)\rho_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)} is the probability ratio and AtA_t is the estimated advantage.

Implementation Considerations and Efficiency

A significant practical advantage of PRIME is its reduced overhead compared to traditional PRM approaches. It eliminates the need for a separate, potentially costly, PRM pre-training phase. Instead, the implicit PRM πϕ\pi_\phi and the reference model πref\pi_{\text{ref}} can be initialized directly from the Supervised Fine-Tuned (SFT) model used as the starting point for RL. Experiments show this initialization strategy is effective, potentially benefiting from reduced initial distribution shift between the policy and reward models compared to using independently trained reward models (2502.01456).

To further enhance training efficiency and stability, PRIME incorporates Online Prompt Filtering. This technique dynamically selects prompts for policy rollouts and PRM updates based on the policy's current performance (e.g., measured by outcome reward). It aims to focus training on prompts within an appropriate difficulty range and maintain a balanced dataset for the implicit PRM updates, preventing the reward model from being skewed by excessively easy or hard examples.

Experimental Validation

Experiments were conducted primarily using the Qwen2.5-Math-7B-Base model, fine-tuned with PRIME on mathematical reasoning (AIME, AMC, MATH, MinervaMath, OlympiadBench) and coding (LeetCode, LiveCodeBench) tasks. Outcome rewards were provided by rule-based verifiers.

  • Performance Gains: Starting from a lightly SFT-warmed model (Eurus-2-7B-SFT), the PRIME-trained model (Eurus-2-7B-PRIME) achieved substantial improvements, averaging 15.1% across key benchmarks over the SFT baseline. Gains were particularly pronounced on competitive math datasets like AMC (+20.3%) and AIME (+27.6%) (2502.01456). Notably, Eurus-2-7B-PRIME surpassed the larger Qwen2.5-Math-7B-Instruct model on seven reasoning benchmarks while using approximately 10% of its training data.
  • Comparison to Outcome-Only RL: Compared to RLOO using only outcome rewards (OV Only), PRIME demonstrated superior sample efficiency, reaching comparable training reward levels approximately 2.5x faster. It also achieved higher final training rewards (+6.9%) and consistently better generalization performance on test benchmarks (2502.01456).
  • Ablation Studies:
    • Online vs. Offline PRM: Ablations confirmed the critical importance of online updates for the implicit PRM πϕ\pi_\phi. Using a static (offline) PRM led to significant performance degradation, attributed to reward hacking arising from the distribution shift between the evolving policy and the fixed reward model (2502.01456). Online updates effectively mitigated this issue.
    • Initialization: Initializing πϕ\pi_\phi from the SFT model proved as effective, if not slightly better, than using a separately pre-trained PRM, validating the cost-saving initialization strategy (2502.01456).
    • Algorithm Generality: PRIME successfully enhanced the performance and sample efficiency of various underlying RL algorithms, including REINFORCE, GRPO, and PPO, highlighting its broad applicability (2502.01456).
    • Implicit Reward vs. Value Function: Using the implicit PRM to provide dense rewards (for return calculation) was found to be more effective than using it as a baseline (value function approximator) within the advantage estimation (2502.01456).
    • "Zero" RL: Experiments initializing RL directly from the base LLM (without SFT) showed feasibility, particularly for larger models (32B), suggesting potential for further reducing pre-training requirements, although convergence characteristics may differ (2502.01456).

Conclusion

The PRIME framework provides a practical and effective method for incorporating dense reward signals into the online reinforcement learning of LLMs for complex reasoning tasks. By leveraging implicit process rewards derived from an outcome-label-trained implicit PRM that permits online updates, PRIME overcomes key limitations related to annotation cost and reward model staleness. It demonstrably improves sample efficiency and final task performance compared to outcome-only RL methods and offers significant savings in development overhead by obviating the need for explicit process labels and dedicated PRM pre-training.

Youtube Logo Streamline Icon: https://streamlinehq.com