OCRM: Off-Policy Corrected Reward Modeling

Updated 27 July 2025

OCRM is a reinforcement learning framework that corrects reward models using off-policy data and importance weighting to tackle distribution shifts.
It recalibrates the reward model’s loss based on the evolving policy distribution, preventing overoptimization and statistical inconsistencies.
Empirical results show OCRM improves policy alignment and win rates in RLHF tasks, demonstrating enhanced robustness under distribution shift.

Off-Policy Corrected Reward Modeling (OCRM) is a methodological framework within reinforcement learning (RL) that aims to address the statistical inconsistency and performance degradation caused by distribution shift between the policy under which reward models are trained and the policy under which they are used for optimization. The core principle of OCRM is to correct the reward model—or its loss—using off-policy data, typically via importance weighting or other distributional adjustments, so as to maintain consistent estimation of reward model parameters and policy gradients as the policy changes. This approach is particularly relevant to practical RL pipelines such as Reinforcement Learning from Human Feedback (RLHF), where a reward model trained on static, human-labeled data may be applied to sequentially updated RL policies whose generated data distributions differ significantly from those seen in supervised training (Ackermann et al., 21 Jul 2025).

1. Motivation and Conceptual Framework

OCRM is motivated by the observation that, in RLHF and related pipelines, optimizing a policy against a fixed reward model often leads to distribution shift: as the policy evolves, it generates candidate trajectories or responses increasingly dissimilar to those observed during reward model (RM) training. This shift causes two pathologies: “overoptimization” or Goodharting, in which the RM’s scores continue to improve while diverging from genuine human preference, and statistical inconsistency of reward model parameter estimation, which leads to policy gradients that do not accurately reflect the intended reward structure.

OCRM formalizes the insight that the objective for the RM should not remain “on-policy” relative to its training data, but must instead be corrected to reflect the data distribution induced by the current policy. This is achieved by minimizing a loss function for the reward model over the current policy’s distribution, with importance weighting applied to the original RM training data (collected under the initial policy) to estimate the correct risk. The OCRM framework thus updates or retrains the RM as the policy diverges, continually correcting for off-policy effects.

2. Formal Methodology and Loss Correction

A prototypical OCRM pipeline in RLHF comprises the following steps:

Initial Reward Model Training: The RM, parameterized by θ, is trained on data D_RM = {(s, a_w, a_ℓ)} (contexts and pairwise action preferences) generated by a fixed reference policy (often the supervised fine-tuned, or SFT, model, noted as π_red). The standard loss is a cross-entropy or Bradley–Terry objective:

$\mathcal{L}_{\mathrm{RM}}^{(\pi_{\mathrm{red}})}(\theta) = \mathbb{E}_{(s, a_w, a_\ell) \sim P_{\pi_\mathrm{red}}} \left[ -\log \sigma(R_\theta(s, a_w) - R_\theta(s, a_\ell)) \right]$

Policy Optimization and Distribution Shift: Over RL updates (using PPO or related algorithms), the policy π (denoted π_blue as it evolves) generates new candidate outputs increasingly different from those in D_RM.
Off-Policy Correction via Importance Weighting: The OCRM objective replaces the “on-policy” loss with a risk that targets the current policy’s data distribution. Since new labeled data from the shifted policy’s support are unavailable, the loss is estimated as:

$\mathcal{L}_{\mathrm{RM}}^{(\pi_{\mathrm{blue}})}(\theta) = \mathbb{E}_{(s, a_w, a_\ell) \sim P_{\pi_\mathrm{red}}} \left[ w(s, a_w, a_\ell) \cdot \ell_{\mathrm{RM}}(s, a_w, a_\ell; \theta) \right]$

where the importance weight is

$w(s, a_w, a_\ell) = \frac{\pi_{\mathrm{blue}}(a_w|s) \pi_{\mathrm{blue}}(a_\ell|s)}{\pi_{\mathrm{red}}(a_w|s) \pi_{\mathrm{red}}(a_\ell|s)}$

and $\ell_{\mathrm{RM}}$ is the cross-entropy loss on the preference pair.

Iterative Retraining and Policy Regularization: Rather than retraining the RM after every policy update, OCRM typically performs RM retraining in stages, each using the most recent policy, while the RL objective includes a KL regularization term anchoring the policy to where the RM remains accurate.

This procedure ensures consistency of gradient estimates and prevents the accumulation of bias arising from optimizing against a reward model detached from the evolving policy distribution.

3. Theoretical Properties and Estimation Error

OCRM’s statistical consistency relies on standard importance weighting properties. Assuming the support of the new policy π_blue overlaps sufficiently with the data-generating policy π_red, importance weighting ensures that the risk minimized by the RM is unbiased with respect to the intended (current) distribution. The estimation error for the off-policy corrected RM can be quantitatively bounded: given bounded per-sample loss and weights in [0, W], with n labeled samples, the high-probability bound on excess risk is

$L^{(\pi_{\mathrm{blue}})}(\hat{\theta}) - L^{(\pi_{\mathrm{blue}})}(\theta^*) \le 4W \mathcal{R}_n(\mathcal{J}) + W C_\ell \sqrt{\frac{2}{n} \log \frac{2}{\delta}}$

where $\mathcal{R}_n(\mathcal{J})$ is the Rademacher complexity of the margin function class and $C_\ell$ bounds the per-sample loss. This guarantee clarifies that OCRM’s statistical accuracy increases (risk converges) as more (labeled) samples are available and with bounded variance in the importance weights.

Empirical evidence supports these theoretical conclusions: in low-dimensional diagnostic experiments, importance-weighted OCRM retraining of the RM preserves the accuracy of reward gradients even under significant distribution shift, permitting further policy improvement beyond the “overoptimized” regime encountered with a fixed RM.

4. Practical Implementation and Algorithmic Structure

A typical OCRM schedule consists of alternating stages of:

Off-policy retraining of the RM using importance weighting, either until convergence or for a fixed number of steps.
RL policy optimization (e.g., PPO updates) using the updated RM, typically with a KL regularizer to control the extent of policy shift.
Resetting value network weights after RM update, since the value function is closely tied to the reward model used for optimization.

Practical algorithmic details include:

Use of “flattened” or “relative” importance weights to control variance, e.g., raising the importance weights to a power $\eta < 1$ or blending denominators with a mixture ratio $\alpha$ .
Employing batches of $k$ RL updates between RM corrections (e.g., $k = 100{,}000$ samples per iteration).
Resetting the PPO value network and optimizer state when a new RM is introduced, as the prior value estimates become unreliable.

Table: OCRM RM Retraining Schedule

Step	Input	Loss Function	Distribution Correction	Frequency
RM Training	D_RM from π_red	Cross-entropy (pairwise)	None (initial RM)	1 (initial)
OCRM Correction	D_RM from π_red	Cross-entropy	Importance weighting w(s,a_w,a_ℓ)	Each OCRM round
PPO Optimization	Model-generated data	PPO policy objective	KL-regularized to last RM/π	k steps per RM

A plausible implication is that proper scheduling of RM retraining, importance weight hyperparameters (η, α), and PPO regularization are essential for effective and stable OCRM performance.

5. Empirical Results and Performance Gains

OCRM has demonstrated consistent improvements in both synthetic and LLM (LM) alignment tasks. For instance, on the TL;DR summarization task (using the setup of [Stiennon et al., 2020]), OCRM, operating for m = 2 or 3 RM correction iterations, raises the win rate (as measured by a “gold” RM or synthetic GPT-4.1 judgments) from a baseline of ~63% (for PPO alone) to ~73–74%. Comparable improvements are observed in Alpaca-Farm chatbot scenarios.

Key empirical findings:

OCRM is robust to overoptimization and continues to improve policy alignment with human preferences as distribution shift accumulates.
Resetting the value network after each RM retraining avoids accuracy degradation from outdated value estimates.
The benefits persist across both low-dimensional and high-dimensional tasks and are not restricted to any particular LM architecture.
OCRM consistently outperforms standard RLHF, DPO, WPO, and RLP-SPG baselines under comparable compute budgets and training schedules.

6. Implementation Requirements and Limitations

OCRM requires:

Access to the data-generating policy π_red for each RM training sample to evaluate the importance weights. This is usually satisfied when the RM dataset is fully traceable to policies used during supervised or earlier RLHF stages.
Computational resources sufficient to retrain or fine-tune the RM iteratively throughout RLHF, although the retraining frequency can be tuned for efficiency.
Proper control of variance in importance weights, especially for long training runs or when the support of π_blue diverges from π_red.

Limitations include:

Increased engineering and compute burden relative to fixed-RM RLHF pipelines.
Applicability to settings with multiple, mixed-origin preference datasets may require further methodological adaptations.
The need to verify that the empirical support of the RM training set covers the policy’s outputs throughout training; extrapolation beyond support remains a challenge.

Future work, as suggested in the source, includes improving computational efficiency, generalizing to heterogeneous or nonstationary RM datasets, and integrating advanced reward modeling strategies (such as ensembles or weight-averaging).

7. Significance and Outlook

OCRM establishes a statistically principled foundation for reward modeling in off-policy RL. By explicitly correcting for distribution shift via importance weighting, it restores consistent RM estimation and gradient computation even as the policy evolves far from initial supervised data distributions. This significantly mitigates overoptimization, leading to policies better aligned with true (human) preferences and improved empirical metrics across both classic and LM alignment tasks.

These advances are backed by both finite-sample statistical theory (estimation error and excess risk bounds) and by strong empirical improvements (notably, robust win-rate gains on LM alignment benchmarks). OCRM represents a direct and theoretically sound intervention for one of the central pathologies of RLHF, and its open-source implementation promotes reproducibility and further investigation of its practical nuances (Ackermann et al., 21 Jul 2025).

PDF Markdown Chat (Pro)

References (1)

Off-Policy Corrected Reward Modeling for Reinforcement Learning from Human Feedback (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Off-Policy Corrected Reward Modeling (OCRM).