Process Reinforcement through Implicit Rewards (2502.01456v1)

Published 3 Feb 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of LLMs, particularly in tasks requiring complex multi-step reasoning. While dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards, such as training efficiency and credit assignment, this potential remains largely unrealized. This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive, making them particularly vulnerable to reward hacking. To address these challenges, we propose PRIME (Process Reinforcement through IMplicit rEwards), which enables online PRM updates using only policy rollouts and outcome labels through implict process rewards. PRIME combines well with various advantage functions and forgoes the dedicated reward model training phrase that existing approaches require, substantially reducing the development overhead. We demonstrate PRIME's effectiveness on competitional math and coding. Starting from Qwen2.5-Math-7B-Base, PRIME achieves a 15.1% average improvement across several key reasoning benchmarks over the SFT model. Notably, our resulting model, Eurus-2-7B-PRIME, surpasses Qwen2.5-Math-7B-Instruct on seven reasoning benchmarks with 10% of its training data.

Summary

The paper introduces PRIME, a framework that uses implicit rewards for dense, token-level feedback in LLM reinforcement learning.
It integrates online PRM updates using outcome labels, eliminating the need for explicit process supervision and mitigating reward hacking.
Experiments demonstrate that PRIME accelerates training and improves sample efficiency while outperforming outcome-only RL on complex reasoning benchmarks.

Introduction to PRIME

Process-level reinforcement learning for LLMs seeks to leverage dense, intermediate feedback signals rather than relying solely on sparse, outcome-level rewards. While dense rewards theoretically offer advantages in sample efficiency and credit assignment, practical application has been hindered by the prohibitive cost of acquiring fine-grained process labels and the challenges of updating Process Reward Models (PRMs) online to prevent reward hacking. The PRIME (Process Reinforcement through IMplicit rEwards) framework addresses these limitations by enabling online PRM updates using only readily available outcome labels through the mechanism of implicit process rewards (2502.01456). This approach avoids the need for explicit process supervision and dedicated PRM pre-training phases.

Implicit Process Rewards and Online Updates

The core innovation of PRIME lies in its use of an Implicit Process Reward Model (pi_phi), which is trained similarly to a standard Outcome Reward Model (ORM). This model is trained to predict the final outcome reward, $r_o(\mathbf{y})$ , given the entire generated sequence $\mathbf{y}$ . Specifically, it optimizes a loss function (e.g., cross-entropy for binary outcomes) based on pairs $(\mathbf{y}, r_o(\mathbf{y}))$ . The reward function associated with this implicit PRM is defined relative to a reference model $\pi_{\text{ref}}$ :

$r_\phi(\mathbf{y}) := \beta \log\left(\frac{\pi_\phi(\mathbf{y})}{\pi_{\text{ref}}(\mathbf{y})}\right)$

Although trained only on sequence-level outcome labels $r_o(\mathbf{y})$ , the implicit PRM $\pi_\phi$ allows for the extraction of token-level implicit process rewards during the RL phase:

$r_\phi(y_t) := \beta \log\left(\frac{\pi_\phi(y_t|\mathbf{y}_{<t})}{\pi_{\text{ref}}(y_t|\mathbf{y}_{<t})}\right)$

Here, $r_\phi(y_t)$ represents the reward contribution of generating token $y_t$ at step $t$ . This formulation provides dense, token-level feedback without requiring explicit intermediate annotations.

Crucially, because the training objective for $\pi_\phi$ relies only on outcome labels $r_o(\mathbf{y})$ – which can be obtained from the environment, rule-based verifiers, or a static ORM during RL – the implicit PRM $\pi_\phi$ can be updated online alongside the policy $\pi_\theta$ . This online update process involves collecting rollouts $(\mathbf{y}, r_o(\mathbf{y}))$ from the current policy $\pi_\theta$ and using them to fine-tune $\pi_\phi$ . This continuous adaptation of the reward model helps mitigate distribution shift between the policy and the reward model, thereby reducing the risk of reward hacking, a common issue when using static reward models.

Advantage Estimation and Policy Optimization

PRIME integrates both the dense implicit process rewards $r_\phi(y_t)$ and the sparse outcome reward $r_o(\mathbf{y})$ into the advantage estimation. The framework is compatible with various advantage estimators. The paper finds particular success using Monte-Carlo estimators with a leave-one-out baseline (RLOO), which estimates the advantage for sample $i$ using the average return over $K-1$ other samples in the batch. The advantage for sample $i$ at timestep $t$ is calculated by summing the advantages derived from the process and outcome rewards:

$A^i_t = \left( \sum_{k=t}^{T} \gamma^{k-t} r_\phi(y_k^i) - \text{Baseline}_{\text{RLOO}}(r_\phi, t) \right) + \left( \gamma^{T-t} r_o(\mathbf{y}^i) - \text{Baseline}_{\text{RLOO}}(r_o) \right)$

where $\gamma$ is the discount factor. The policy $\pi_\theta$ is then updated using a policy gradient method, such as the PPO clipped surrogate objective:

$\mathcal{L}_{\text{CLIP}}(\theta) = \hat{\mathbb{E}}_t \left[ \min( \rho_t(\theta) A_t, \text{clip}(\rho_t(\theta), 1-\epsilon, 1+\epsilon) A_t ) \right]$

where $\rho_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}$ is the probability ratio and $A_t$ is the estimated advantage.

Implementation Considerations and Efficiency

A significant practical advantage of PRIME is its reduced overhead compared to traditional PRM approaches. It eliminates the need for a separate, potentially costly, PRM pre-training phase. Instead, the implicit PRM $\pi_\phi$ and the reference model $\pi_{\text{ref}}$ can be initialized directly from the Supervised Fine-Tuned (SFT) model used as the starting point for RL. Experiments show this initialization strategy is effective, potentially benefiting from reduced initial distribution shift between the policy and reward models compared to using independently trained reward models (2502.01456).

To further enhance training efficiency and stability, PRIME incorporates Online Prompt Filtering. This technique dynamically selects prompts for policy rollouts and PRM updates based on the policy's current performance (e.g., measured by outcome reward). It aims to focus training on prompts within an appropriate difficulty range and maintain a balanced dataset for the implicit PRM updates, preventing the reward model from being skewed by excessively easy or hard examples.

Experimental Validation

Experiments were conducted primarily using the Qwen2.5-Math-7B-Base model, fine-tuned with PRIME on mathematical reasoning (AIME, AMC, MATH, MinervaMath, OlympiadBench) and coding (LeetCode, LiveCodeBench) tasks. Outcome rewards were provided by rule-based verifiers.

Performance Gains: Starting from a lightly SFT-warmed model (Eurus-2-7B-SFT), the PRIME-trained model (Eurus-2-7B-PRIME) achieved substantial improvements, averaging 15.1% across key benchmarks over the SFT baseline. Gains were particularly pronounced on competitive math datasets like AMC (+20.3%) and AIME (+27.6%) (2502.01456). Notably, Eurus-2-7B-PRIME surpassed the larger Qwen2.5-Math-7B-Instruct model on seven reasoning benchmarks while using approximately 10% of its training data.
Comparison to Outcome-Only RL: Compared to RLOO using only outcome rewards (OV Only), PRIME demonstrated superior sample efficiency, reaching comparable training reward levels approximately 2.5x faster. It also achieved higher final training rewards (+6.9%) and consistently better generalization performance on test benchmarks (2502.01456).
Ablation Studies:
- Online vs. Offline PRM: Ablations confirmed the critical importance of online updates for the implicit PRM $\pi_\phi$ . Using a static (offline) PRM led to significant performance degradation, attributed to reward hacking arising from the distribution shift between the evolving policy and the fixed reward model (2502.01456). Online updates effectively mitigated this issue.
- Initialization: Initializing $\pi_\phi$ from the SFT model proved as effective, if not slightly better, than using a separately pre-trained PRM, validating the cost-saving initialization strategy (2502.01456).
- Algorithm Generality: PRIME successfully enhanced the performance and sample efficiency of various underlying RL algorithms, including REINFORCE, GRPO, and PPO, highlighting its broad applicability (2502.01456).
- Implicit Reward vs. Value Function: Using the implicit PRM to provide dense rewards (for return calculation) was found to be more effective than using it as a baseline (value function approximator) within the advantage estimation (2502.01456).
- "Zero" RL: Experiments initializing RL directly from the base LLM (without SFT) showed feasibility, particularly for larger models (32B), suggesting potential for further reducing pre-training requirements, although convergence characteristics may differ (2502.01456).

Conclusion

The PRIME framework provides a practical and effective method for incorporating dense reward signals into the online reinforcement learning of LLMs for complex reasoning tasks. By leveraging implicit process rewards derived from an outcome-label-trained implicit PRM that permits online updates, PRIME overcomes key limitations related to annotation cost and reward model staleness. It demonstrably improves sample efficiency and final task performance compared to outcome-only RL methods and offers significant savings in development overhead by obviating the need for explicit process labels and dedicated PRM pre-training.

PDF Markdown

Related Papers

Tweets

https://twitter.com/Grad62304977/status/1923037297064198169

https://twitter.com/fly51fly/status/1886915412224528497

https://twitter.com/rohanpaul_ai/status/1891819945752772792

https://twitter.com/stingning/status/1886786443198267715

https://twitter.com/arXivGPT/status/1887200617581449644

https://twitter.com/AAgentsLLM/status/1887327378122141860

YouTube

Show All Videos