Conditional Reward Modeling (CRM)

Updated 2 June 2026

Conditional Reward Modeling (CRM) is a framework that systematically conditions rewards on both current states and desired outcomes, ensuring temporal and structural coherence.
It integrates techniques like Monte Carlo and temporal-difference regularization, potential-based shaping, and variational Bayesian inference to resolve credit assignment and enhance model performance.
CRM has demonstrated significant improvements in reinforcement learning from human feedback, large language model reasoning, and generative modeling, offering robustness and efficient resource utilization.

Conditional Reward Modeling (CRM) is a general framework in which the reward assigned at each point in a structured generation or decision process is systematically conditioned on both the present state (e.g., the current partial generation in language or trajectory in control) and the final or desired outcome. The goal is to enforce temporal and structural coherence, resolve ambiguous credit assignment, and enable adaptation to multifaceted or user-specified objectives. CRM methodologies have been developed in reinforcement learning from human feedback (RLHF), LLM reasoning, generative modeling (diffusion for images or video), and subjective preference aggregation for model selection. This article reviews the core principles, theoretical formulations, empirical results, and applications of CRM across these domains.

1. Core Principles and Mathematical Foundations

At its heart, Conditional Reward Modeling formalizes the reward as a conditional expectation or probability, interlinking intermediate steps and final outcomes via probabilistically consistent rules:

Token-level CRM in RLHF: For sequence models, the token-level reward model $r_\theta(h_{<t}, x_t)$ must satisfy

$r_\theta\bigl(h_{<t}, x_t\bigr) = \mathbb{E}\bigl[r_{\rm final}\mid h_{<t}, x_t\bigr],$

with $r_{\rm final} = r_\theta(h_{\leq T})$ the terminal preference score. This enforces that every intermediate output anticipates the expected final preference, inducing a form of "reward as value function" (Nikulkov, 24 Apr 2026).

Step-wise CRM in reasoning: For process modeling in LLMs, CRM assigns shaped rewards via the log-probability of correctness at each step, conditioned on correct prior steps:

$r_t = \log \operatorname{Pr}(\text{step } t \text{ is correct} \mid \text{all previous correct}),$

so that the sum of $r_t$ equals $\log$ probability of total trajectory correctness. This aligns intermediate and eventual outcomes in a temporally causal fashion (Zhang et al., 30 Sep 2025).

Conditional generative CRM: In diffusion-based models (images or video), the reward is the negative conditional entropy (or a related measure) of a generative model conditioned on the observed history or a target reward value:

$r_k^{\rm ce} = -H\left(p_\theta(\cdot \mid z_{0:k-1})\right),$

and generation is directed by conditioning the diffusion process on a chosen outcome or reward (Huang et al., 2023, Yuan et al., 2023).

Criterion- and context-conditioned CRM: For subjective model evaluation, CRM treats preferences as functions of user-specified criteria, so that scoring and model ranking are dynamic with respect to high-level instructions (e.g., “concise” versus “in-depth”), providing a bijective mapping between stated criteria and the induced preference order (Jia et al., 13 Aug 2025).

2. Optimization Objectives and Training Methodologies

CRM frameworks typically augment or reparameterize standard loss functions to guarantee conditional consistency:

Monte Carlo (Lookahead) and Temporal-Difference (TD) Regularization: In RLHF, Temporally Coherent Reward Modeling (TCRM) adds two quadratic losses on top of the standard Bradley-Terry pairwise final-token loss:
- Lookahead: Ensures each intermediate score is consistent with the final outcome (Monte Carlo projection).
- TD: Enforces smoothness via one-step bootstrapping, analogous to value function learning.
- The joint minimizer of these objectives is provably the token-level conditional expectation of final reward (Nikulkov, 24 Apr 2026).
Potential-based shaping in reasoning: CRM applies potential-based reward shaping, where the potential $\Phi(s_t) = \log S(t)$ accumulates the likelihood of correctness at each step, allowing for explicit credit assignment and temporal causality. The reward at each transition is the difference in potentials, amounting to a log-probability increment (Zhang et al., 30 Sep 2025).
Variational Bayesian inference: Contextually steerable models (ICRM) represent reward probabilities as Beta distributions, fitting an ELBO objective that balances preference matching and prior regularization. At test time, in-context preference demonstrations adjust the posterior, yielding steerable preference prediction (Hong et al., 9 Feb 2026).
Entropy and uncertainty regularization: For conditional image/video generation, CRM frameworks (e.g., Diffusion Reward, Ctrl-U) use negative conditional entropy or uncertainty-weighted consistency losses to shape and stabilize rewards, respectively, especially when discriminators are unreliable off-distribution (Huang et al., 2023, Zhang et al., 2024).

3. Applications Across Domains

CRM has been instantiated in multiple domains:

Domain	CRM Instantiation	Key Mechanism
RLHF for LLMs/process modeling	Token-level reward as final expectation	TCRM: Monte Carlo + TD regularization
LLM step-wise reasoning	Step reward via log-prob of correctness	Potential-based shaping, error labeling
Vision-based control	Dense reward via negative cond. entropy	Video diffusion, conditional entropy
Conditional generation (images)	Uncertainty-aware reward weighting	Doubled passes, KL or L1 uncertainty
Subjective model evaluation	User-specified criterion conditioning	Explicit criterion input, pairwise cls

Empirical results confirm CRM’s advantages:

On RLHF, TCRM improves middle-token pairwise accuracy from 50% to 88.9% (Qwen3-32B), preserves final accuracy, and enables unified value/reward modeling in PPO with up to 27% memory savings (Nikulkov, 24 Apr 2026).
For LLM reasoning, CRM dominates in best-of-N and beam search, e.g., 68.7% accuracy on GSM-Plus@128 (Qwen-2.5-3B) and 77.8% Pass@1 on MATH500 (LLM, no verifier) (Zhang et al., 30 Sep 2025).
In visual RL, Diffusion Reward achieves 75% and 60% final success rates on MetaWorld (7 tasks) and Adroit (3 tasks), over 38% better than the strongest baselines (Huang et al., 2023).
Criterion-conditioned CRM (CRM-4B) maintains >90% accuracy even under adversarial noising of user criteria and dominates SOTA reward models on subjective LLM leaderboards (Jia et al., 13 Aug 2025).

4. Algorithmic Implementations and Pseudocode

CRM methods are engineered for minimal architectural disruption:

RLHF (TCRM):

for i in 1…N:
  (x, y^w, y^l) = sample_pair()
  Rw = [r_theta(x, y^w_{0..k}) for k in 0..K^w]
  Rl = [r_theta(x, y^l_{0..k}) for k in 0..K^l]
  L_bt = -logsigmoid(Rw[K^w] - Rl[K^l])
  L_mc = sum((Rw[k] - stopgrad(Rw[K^w]))^2 for k in 0..K^w-1) + ...
  L_td = sum((Rw[k-1] - stopgrad(Rw[k]))^2 for k in 1..K^w) + ...
  L_total = L_bt + λ_MC * L_mc + λ_TD * L_td
  accumulate_gradients(L_total)
optimizer.step()

ICRM test-time inference:

1 2	def ICRM_Infer(x, y_plus, y_minus, Context_C): # aggregate evidence, posterior mean/concentration computation, return sigmoid(u_plus - u_minus)

Diffusion Reward (RL):

Collect expert videos, train latent conditional diffusion.
At each RL step, estimate conditional entropy-based reward, combine with exploration (RND) and environment signals.
Use off-policy RL (e.g., DrQv2) with this composite reward (Huang et al., 2023).

All detailed pseudocode adheres strictly to implementations appearing in the original works.

5. Empirical Findings and Theoretical Guarantees

CRM delivers the following empirical and theoretical advances:

Interpretability and process-level coherence: CRM-trained RMs provide interpretable, coherent reward trajectories—errors at intermediate steps or tokens are penalized consistently, facilitating error diagnosis and process evaluation (Nikulkov, 24 Apr 2026, Zhang et al., 30 Sep 2025).
Robustness to reward hacking: By explicitly conditioning on history and outcomes, CRM prevents degenerate reward escalation on pathological sequences (e.g., long, repetitive chains that degrade true accuracy yet appear high-reward under naive models) (Zhang et al., 30 Sep 2025).
Probabilistic comparability: Outcome-linked CRM rewards enable direct cross-sample comparison, supporting more reliable model or trajectory selection via best-of-N or beam search (Zhang et al., 30 Sep 2025, Jia et al., 13 Aug 2025).
Steerability and customization: Parametric CRM architectures (e.g., ICRM, CRM-4B) offer test-time customization via in-context demonstrations or explicit user-supplied criteria, with demonstrable improvements in both single- and multi-objective settings (Hong et al., 9 Feb 2026, Jia et al., 13 Aug 2025).
Theoretical optimality: Minima of key loss objectives correspond to conditional expectations (TCRM), or non-degenerate variational posteriors (ICRM), with convexity ensuring unique, well-behaved solutions (Nikulkov, 24 Apr 2026, Hong et al., 9 Feb 2026).
Resource efficiency: CRM enables, e.g., unified reward/value sharing in RL loops, leading to significant compute and memory savings (Nikulkov, 24 Apr 2026).

6. Limitations and Ongoing Challenges

While CRM advances both the statistical and practical foundations of reward modeling, several limitations remain:

Annotation costs: Some CRM variants require trajectory-level and error localization annotations; even so, partial annotation (10-50%) can suffice, but scaling this further or leveraging self-supervision is an open direction (Zhang et al., 30 Sep 2025).
Inference cost: In conditional diffusion or uncertainty-aware settings, repeated forward passes (e.g., dual sampling for uncertainty) increase training compute requirements (Huang et al., 2023, Zhang et al., 2024).
Uncertainty quantification: CRM frameworks that rely on cognitive uncertainty do not always separate model and data/annotation uncertainty, potentially leading to suboptimal regularization (Zhang et al., 2024).
Generalization: While empirical evidence is strong, extending CRM to settings with truly open-ended, ambiguous, or partially specified objectives may require new training signal sources and further improvements in reward calibration (Jia et al., 13 Aug 2025).

A plausible implication is that future research will focus on scaling CRM to open-ended, multi-modal, or multi-agent environments, developing semi/self-supervised error labeling, and refining uncertainty regularization.

7. Outlook and Future Directions

Emerging directions in CRM research include:

Unified modeling: Further unification of value and reward prediction, joint online policy and reward model training, and “plug-and-play” CRM as a backbone for diverse RL and generation tasks (Nikulkov, 24 Apr 2026).
Rich user interaction: Expanding CRM frameworks to accept rich, multi-criteria, and even natural language descriptions of objectives, enabling “subjective leaderboards” and real-time preference steering (Jia et al., 13 Aug 2025, Hong et al., 9 Feb 2026).
Zero-shot and domain-adaptive rewards: Extending CRM to work with in-the-wild expert data or natural environment videos for efficient zero-shot RL, and distilling expensive generative rewards into lightweight surrogates (Huang et al., 2023).
Theoretical refinement: Tighter bounds on reward/behavior optimality gaps (e.g., via bandit regret), more precise quantification of trade-offs between fidelity and extrapolation, and improvements in latent space recoverability (Yuan et al., 2023).

CRM constitutes a principled, theoretically grounded, and rapidly expanding paradigm for reward modeling, unifying preference learning, causal process modeling, conditional generation, and subjective evaluation under a single probabilistic and conditional expectation-based umbrella.