On-Policy RL with Optimal Reward Baseline (2505.23585v2)

Published 29 May 2025 in cs.LG and cs.CL

Abstract: Reinforcement learning algorithms are fundamental to align LLMs with human preferences and to enhance their reasoning capabilities. However, current reinforcement learning algorithms often suffer from training instability due to loose on-policy constraints and computational inefficiency due to auxiliary models. In this work, we propose On-Policy RL with Optimal reward baseline (OPO), a novel and simplified reinforcement learning algorithm designed to address these challenges. OPO emphasizes the importance of exact on-policy training, which empirically stabilizes the training process and enhances exploration. Moreover, OPO integrates a practically feasible formulation of the optimal reward baseline that minimizes gradient variance. We evaluate OPO on mathematical reasoning benchmarks. The results demonstrate its superior performance and training stability without additional models or regularization terms. Furthermore, OPO achieves lower policy shifts and higher output entropy, encouraging more diverse and less repetitive responses. These results highlight OPO as a promising direction for stable and effective reinforcement learning in LLM alignment and reasoning tasks. The implementation is merged into the verl library at https://verl.readthedocs.io/en/latest/algo/opo.html.

Summary

The paper introduces OPO, which enforces exact on-policy training to mitigate policy shifts and enhance exploration.
It derives an optimal reward baseline that minimizes gradient variance, eliminating the need for auxiliary value models.
Experimental results show improved pass@k performance and stable training on LLM reasoning tasks.

The paper "On-Policy RL with Optimal Reward Baseline" (2505.23585) introduces OPO (On-Policy RL with Optimal reward baseline), a simplified reinforcement learning algorithm designed to address training instability and computational inefficiency in methods like PPO when applied to LLMs, particularly in alignment and reasoning tasks. OPO proposes two key ideas: enforcing exact on-policy training and using a theoretically derived optimal reward baseline to minimize gradient variance.

The authors argue that current RL algorithms, often using loose on-policy or off-policy updates (e.g., reusing data for multiple steps as in PPO), can suffer from unstable training, large policy shifts, and reduced sample diversity. Additionally, methods like PPO require training an auxiliary value model, adding computational overhead. While GRPO (2402.03300) removes the value model by using group-wise reward normalization, it can still be prone to instability due to loose on-policy constraints. OPO aims to overcome these issues by strictly adhering to on-policy data collection for each update and using a more theoretically grounded baseline.

Method: On-Policy RL with Optimal Reward Baseline (OPO)

OPO's core is built upon two principles:

Exact On-Policy Training: Instead of collecting a batch of data and performing multiple gradient updates on it (which makes subsequent updates off-policy relative to the later policy versions), OPO samples data directly from the current policy for every single gradient update. This ensures that the objective function being optimized accurately reflects the expected reward under the policy being updated. The authors claim this empirical approach leads to more stable training, better exploration by maintaining higher policy entropy, and a reduced "alignment tax" (undesirable drift from the initial supervised fine-tuned policy).
Optimal Reward Baseline for Variance Reduction: Policy gradient methods benefit from subtracting a baseline from the reward to reduce variance without changing the expected gradient. The paper derives the optimal baseline $b^*$ that minimizes the variance of the policy gradient estimate:

$b^* = \frac{ \mathbb{E}_{y \sim \pi_{\theta}(\cdot | x)} \left[ \left( \nabla_{\theta} \log \pi_{\theta}(y | x) \right)^2 \cdot r(x, y) \right] }{ \mathbb{E}_{y \sim \pi_{\theta}(\cdot | x)} \left[ \left( \nabla_{\theta} \log \pi_{\theta}(y | x) \right)^2 \right] }$

This baseline is a weighted average of rewards, where the weights are the squared magnitudes of the score function gradients. For sequence generation tasks, they propose a simplification: assuming gradients for different tokens are approximately orthogonal and have similar norms, the squared gradient magnitude is proportional to the sequence length ( $||\nabla_{\theta}\log \pi_{\theta}(y|x)||^2 \propto l_y$ ). This simplifies the optimal baseline to a length-weighted average of rewards:

$b^* = \frac{\mathbb{E}_{y \sim \pi_{\theta}(\cdot | x)} [ l_y \cdot r(x,y) ]}{\mathbb{E}_{y \sim \pi_{\theta}(\cdot | x)} [ l_y ]}$

The OPO algorithm (Algorithm 1 in the paper) integrates these two ideas. For each batch of prompts, it samples $K$ responses from the current policy, computes an empirical estimate of the optimal baseline using these $K$ samples (specifically, the length-weighted average of rewards), calculates the advantage for each sampled response as $A_i = r(x,y_i) - b^*(x)$ , and then updates the policy using the standard policy gradient objective:

$\mathcal{J}_{\text{OPO}(\theta) = \mathbb{E}_{x\sim \mathcal{D}, {\{y_i\}^K_{i=1}\sim\pi_{\theta}(\cdot | x)} \bigg[ \frac{1}{K} \sum_{i=1}^K \log \pi_{\theta}(y_i|x) \cdot A_i(x,y_i) \bigg]$

Crucially, OPO does not include common regularization terms like KL divergence from a reference policy or entropy bonuses, simplifying the objective and hyperparameter tuning.

Practical Implementation and Applications

The core OPO implementation involves a standard LLM fine-tuning loop with a modified objective function.

Here's a pseudocode outline based on the paper's description and Algorithm 1:

optimizer = AdamW(policy_model.parameters(), lr=learning_rate)

for step in range(N_steps):
    # 1. Sample a batch of prompts
    batch_prompts = sample_batch(dataset, batch_size)

    total_loss = 0
    for prompt in batch_prompts:
        # 2. Sample K responses from the *current* policy
        # Ensure sampling is done using the model's current parameters
        # Disable gradient calculation during sampling (inference mode)
        with torch.no_grad():
            sampled_responses = [policy_model.generate(prompt, temperature=0.6, top_p=1.0) for _ in range(K)]

        # 3. Compute rewards and lengths for sampled responses
        rewards = [reward_function(prompt, response) for response in sampled_responses]
        lengths = [len(response.tokens) for response in sampled_responses] # Using token length

        # 4. Compute the empirical optimal baseline (length-weighted average)
        weighted_rewards_sum = sum(length * reward for length, reward in zip(lengths, rewards))
        lengths_sum = sum(lengths)
        optimal_baseline = weighted_rewards_sum / lengths_sum if lengths_sum > 0 else 0

        # 5. Compute advantages
        advantages = [reward - optimal_baseline for reward in rewards]

        # 6. Compute the policy gradient objective (maximize)
        # Re-enable gradient calculation for the policy update step
        # Need to compute log_probs for the sampled responses under the *current* policy
        log_probs = []
        for response in sampled_responses:
            # Assuming a function to get log_prob of a sequence under the model
            log_probs_sequence = policy_model.get_sequence_log_prob(prompt, response)
            log_probs.append(log_probs_sequence) # This assumes trajectory-level log_prob

        # For simplicity, the paper's objective seems to be per-trajectory log_prob * advantage
        # If using per-token gradients, the objective would be a sum over tokens.
        # Based on Eq 6, it looks like per-trajectory log_prob is used.
        objective_terms = [log_prob * advantage for log_prob, advantage in zip(log_probs, advantages)]
        prompt_objective = sum(objective_terms) / K # Average over K responses for this prompt

        total_loss -= prompt_objective # We are maximizing the objective, so minimize negative objective

    # 7. Perform gradient update
    optimizer.zero_grad()
    (total_loss / batch_size).backward() # Average loss over the batch of prompts
    optimizer.step()

Key Implementation Details:

Exact On-Policy: The critical part is that policy_model.generate and subsequent policy_model.get_sequence_log_prob (or computing log_probs during a forward pass on the sampled sequences) must use the parameters before the gradient step is taken. This implies a single gradient step per data collection phase (sampling K responses per prompt).
Sampling K Responses: The number of samples $K$ per prompt is a hyperparameter (8 or 16 in experiments). This impacts the quality of the empirical baseline estimate and computational cost. More samples give a better baseline estimate but increase compute.
Reward Function: The paper uses a simple binary rule-based reward for math problems. OPO is compatible with any scalar reward function $r(x,y)$ . Using a learned reward model would involve a separate model evaluation step for each sampled response.
Batching: The training processes batches of prompts. For each prompt in the batch, $K$ responses are sampled. This means a batch effectively consists of batch_size * K sequences, although the objective is averaged per prompt (or potentially per token across all sampled sequences, depending on the exact LLM implementation).
Computation: Sampling $K$ sequences per prompt and computing their log probabilities and rewards for a batch of prompts can be computationally intensive. Efficient parallel sampling and forward passes are crucial.
No Auxiliary Models: The absence of a separate value model simplifies the architecture and reduces memory requirements compared to actor-critic methods like PPO.
No Regularization: OPO removes KL and entropy regularization terms. This simplifies hyperparameter tuning but relies on the exact on-policy mechanism and optimal baseline for stability and exploration.

Experiments and Results

The paper evaluates OPO primarily on mathematical reasoning benchmarks (MATH-500, AIME 2024, AIME 2025) using a fine-tuned Deepseek-R1-Distill-Qwen model.

Exact On-Policy vs. Off-Policy GRPO:
- Comparison shows exact on-policy training leads to better pass@1 performance and significantly lower KL divergence and higher entropy throughout training, even without explicit regularization, suggesting better exploration and reduced policy shift (alignment tax). Off-policy training shows signs of potential overfitting to the training reward without transferring well to the evaluation task.
OPO vs. On-Policy GRPO:
- Both methods use exact on-policy training, differing only in their baseline calculation (OPO uses optimal length-weighted average, GRPO uses simple average).
- OPO generally outperforms GRPO on pass@k metrics, especially at higher $k$ , and improves upon the initial SFT model's performance.
- OPO maintains similar or slightly higher entropy and lower KL divergence compared to on-policy GRPO, indicating more stable training and less policy shift.
- Analysis of output quality (Rep-5 and Self-BLEU) shows OPO produces more diverse and less repetitive outputs, correlating with the higher entropy observed during training.
OPO vs. Reinforce++ (Appendix):
- A preliminary experiment shows OPO achieving higher training rewards and maintaining higher entropy compared to an on-policy Reinforce++ variant (which uses batch average as baseline), further supporting the benefit of the optimal baseline.

Overall, the experimental results demonstrate that OPO achieves superior performance and training stability compared to baselines like GRPO, without requiring auxiliary models or standard regularization terms. The exact on-policy approach contributes to stability and exploration, while the optimal reward baseline helps in variance reduction, leading to more effective learning.

Practical Considerations

Computational Cost: While OPO removes auxiliary models, the requirement to sample $K$ responses from the current model for every gradient step implies a potentially high sampling cost. Efficient LLM inference and parallelization are critical. The batch size of batch_size * K sequences needs to be manageable in memory and compute.
Reward Function Dependency: The effectiveness of OPO (and any RL method) depends heavily on the quality of the reward signal. The paper uses a simple rule-based reward; performance with learned reward models might introduce other complexities.
Generalizability: OPO was evaluated on math reasoning with a specific model family. Its performance on diverse tasks (summarization, dialogue) and with different LLM architectures and reward types needs further validation, as noted by the authors.
Hyperparameters: While KL/entropy regularization is removed, tuning $K$ (number of samples per prompt), learning rate, batch size, and sampling parameters (temperature, top-p) remains important.

In summary, OPO presents a compelling case for revisiting exact on-policy training and using a theoretically optimal baseline in RL for LLMs. Its simplicity (single policy model, no regularization) and empirical effectiveness on challenging reasoning tasks make it a promising direction for developing more stable and performant RLHF methods. The primary practical consideration for adoption is managing the computational cost of per-step on-policy sampling.

PDF Markdown

GitHub

Tweets

https://twitter.com/bronzeagepapi/status/1928959525417267711