- The paper introduces PRIME, a framework that uses implicit rewards for dense, token-level feedback in LLM reinforcement learning.
- It integrates online PRM updates using outcome labels, eliminating the need for explicit process supervision and mitigating reward hacking.
- Experiments demonstrate that PRIME accelerates training and improves sample efficiency while outperforming outcome-only RL on complex reasoning benchmarks.
Introduction to PRIME
Process-level reinforcement learning for LLMs seeks to leverage dense, intermediate feedback signals rather than relying solely on sparse, outcome-level rewards. While dense rewards theoretically offer advantages in sample efficiency and credit assignment, practical application has been hindered by the prohibitive cost of acquiring fine-grained process labels and the challenges of updating Process Reward Models (PRMs) online to prevent reward hacking. The PRIME (Process Reinforcement through IMplicit rEwards) framework addresses these limitations by enabling online PRM updates using only readily available outcome labels through the mechanism of implicit process rewards (2502.01456). This approach avoids the need for explicit process supervision and dedicated PRM pre-training phases.
Implicit Process Rewards and Online Updates
The core innovation of PRIME lies in its use of an Implicit Process Reward Model (pi_phi
), which is trained similarly to a standard Outcome Reward Model (ORM). This model is trained to predict the final outcome reward, ro(y), given the entire generated sequence y. Specifically, it optimizes a loss function (e.g., cross-entropy for binary outcomes) based on pairs (y,ro(y)). The reward function associated with this implicit PRM is defined relative to a reference model πref:
rϕ(y):=βlog(πref(y)πϕ(y))
Although trained only on sequence-level outcome labels ro(y), the implicit PRM πϕ allows for the extraction of token-level implicit process rewards during the RL phase:
rϕ(yt):=βlog(πref(yt∣y<t)πϕ(yt∣y<t))
Here, rϕ(yt) represents the reward contribution of generating token yt at step t. This formulation provides dense, token-level feedback without requiring explicit intermediate annotations.
Crucially, because the training objective for πϕ relies only on outcome labels ro(y) – which can be obtained from the environment, rule-based verifiers, or a static ORM during RL – the implicit PRM πϕ can be updated online alongside the policy πθ. This online update process involves collecting rollouts (y,ro(y)) from the current policy πθ and using them to fine-tune πϕ. This continuous adaptation of the reward model helps mitigate distribution shift between the policy and the reward model, thereby reducing the risk of reward hacking, a common issue when using static reward models.
Advantage Estimation and Policy Optimization
PRIME integrates both the dense implicit process rewards rϕ(yt) and the sparse outcome reward ro(y) into the advantage estimation. The framework is compatible with various advantage estimators. The paper finds particular success using Monte-Carlo estimators with a leave-one-out baseline (RLOO), which estimates the advantage for sample i using the average return over K−1 other samples in the batch. The advantage for sample i at timestep t is calculated by summing the advantages derived from the process and outcome rewards:
Ati=(k=t∑Tγk−trϕ(yki)−BaselineRLOO(rϕ,t))+(γT−tro(yi)−BaselineRLOO(ro))
where γ is the discount factor. The policy πθ is then updated using a policy gradient method, such as the PPO clipped surrogate objective:
LCLIP(θ)=E^t[min(ρt(θ)At,clip(ρt(θ),1−ϵ,1+ϵ)At)]
where ρt(θ)=πθold(at∣st)πθ(at∣st) is the probability ratio and At is the estimated advantage.
Implementation Considerations and Efficiency
A significant practical advantage of PRIME is its reduced overhead compared to traditional PRM approaches. It eliminates the need for a separate, potentially costly, PRM pre-training phase. Instead, the implicit PRM πϕ and the reference model πref can be initialized directly from the Supervised Fine-Tuned (SFT) model used as the starting point for RL. Experiments show this initialization strategy is effective, potentially benefiting from reduced initial distribution shift between the policy and reward models compared to using independently trained reward models (2502.01456).
To further enhance training efficiency and stability, PRIME incorporates Online Prompt Filtering. This technique dynamically selects prompts for policy rollouts and PRM updates based on the policy's current performance (e.g., measured by outcome reward). It aims to focus training on prompts within an appropriate difficulty range and maintain a balanced dataset for the implicit PRM updates, preventing the reward model from being skewed by excessively easy or hard examples.
Experimental Validation
Experiments were conducted primarily using the Qwen2.5-Math-7B-Base model, fine-tuned with PRIME on mathematical reasoning (AIME, AMC, MATH, MinervaMath, OlympiadBench) and coding (LeetCode, LiveCodeBench) tasks. Outcome rewards were provided by rule-based verifiers.
- Performance Gains: Starting from a lightly SFT-warmed model (Eurus-2-7B-SFT), the PRIME-trained model (Eurus-2-7B-PRIME) achieved substantial improvements, averaging 15.1% across key benchmarks over the SFT baseline. Gains were particularly pronounced on competitive math datasets like AMC (+20.3%) and AIME (+27.6%) (2502.01456). Notably, Eurus-2-7B-PRIME surpassed the larger Qwen2.5-Math-7B-Instruct model on seven reasoning benchmarks while using approximately 10% of its training data.
- Comparison to Outcome-Only RL: Compared to RLOO using only outcome rewards (OV Only), PRIME demonstrated superior sample efficiency, reaching comparable training reward levels approximately 2.5x faster. It also achieved higher final training rewards (+6.9%) and consistently better generalization performance on test benchmarks (2502.01456).
- Ablation Studies:
- Online vs. Offline PRM: Ablations confirmed the critical importance of online updates for the implicit PRM πϕ. Using a static (offline) PRM led to significant performance degradation, attributed to reward hacking arising from the distribution shift between the evolving policy and the fixed reward model (2502.01456). Online updates effectively mitigated this issue.
- Initialization: Initializing πϕ from the SFT model proved as effective, if not slightly better, than using a separately pre-trained PRM, validating the cost-saving initialization strategy (2502.01456).
- Algorithm Generality: PRIME successfully enhanced the performance and sample efficiency of various underlying RL algorithms, including REINFORCE, GRPO, and PPO, highlighting its broad applicability (2502.01456).
- Implicit Reward vs. Value Function: Using the implicit PRM to provide dense rewards (for return calculation) was found to be more effective than using it as a baseline (value function approximator) within the advantage estimation (2502.01456).
- "Zero" RL: Experiments initializing RL directly from the base LLM (without SFT) showed feasibility, particularly for larger models (32B), suggesting potential for further reducing pre-training requirements, although convergence characteristics may differ (2502.01456).
Conclusion
The PRIME framework provides a practical and effective method for incorporating dense reward signals into the online reinforcement learning of LLMs for complex reasoning tasks. By leveraging implicit process rewards derived from an outcome-label-trained implicit PRM that permits online updates, PRIME overcomes key limitations related to annotation cost and reward model staleness. It demonstrably improves sample efficiency and final task performance compared to outcome-only RL methods and offers significant savings in development overhead by obviating the need for explicit process labels and dedicated PRM pre-training.