Annotation-Conditioned Reward Modeling

Updated 4 July 2026

The paper demonstrates that using auxiliary annotations (e.g. correct answers, quality regions, rubrics) effectively refines reward signals, overcoming limitations of pooled binary labels.
It employs diverse conditioning mechanisms to explicitly encode evaluative criteria, leading to improved calibration, interpretability, and reduced reward hacking.
Empirical results show significant gains, such as reduced annotation cost and enhanced robustness across various tasks, validating the approach's practical benefits.

Annotation-conditioned reward modeling denotes a family of reward-learning frameworks in which the reward signal, reward distribution, or reward-generating objective is explicitly shaped by auxiliary annotations rather than by a single pooled preference label alone. The conditioning variable may be a correct answer and hard negatives, coarse quality regions, fine-grained evaluation dimensions, concept comparisons, local transition labels, rubrics, or a modeled annotation mechanism for noisy observational feedback. Across recent work, the motivating diagnosis is consistent: scalar reward compression obscures credit assignment and absolute quality, pooled preferences can become non-rationalizable under annotator heterogeneity, and annotation-trained rewards remain vulnerable to reward hacking under optimization pressure (Chen et al., 13 Feb 2026, Dong et al., 29 May 2026, Pang et al., 2022).

1. Scope and motivating deficiencies

A common starting point is the observation that standard reward modeling pipelines are structurally narrow. Discriminative reward models efficiently fit pairwise preferences but yield relative and uncalibrated scores; generative reward models are more interpretable but often require more expensive pointwise supervision; scalar RLHF objectives compress heterogeneous judgments into one latent axis; and process reward models usually require costly step annotation or Monte Carlo rollouts (Chen et al., 13 Feb 2026, Lee et al., 12 Apr 2026). Annotation-conditioned methods respond by exposing additional evaluative structure to the learning problem rather than forcing the model to infer all structure from binary wins and losses.

Representative conditioning signals span several distinct forms:

Conditioning signal	Representative mechanism	Representative paper
Correct answer $A$ and incorrect answers $\tilde A$	Contrastive pointwise mutual information for step rewards	(Lee et al., 12 Apr 2026)
Good / normal / bad quality regions	Ordinal probabilistic reward with Region Flooding Tuning	(Chen et al., 13 Feb 2026)
Top-3 relevant dimensions and per-dimension preferences	Dynamic dimension selection and aggregation	(Chen et al., 7 Apr 2026)
Relative concept labels	Concept bottleneck reward decomposition	(Laguna et al., 7 Jul 2025)
Preference outcome vector $\gamma$	Conditional implicit reward in diffusion DPO	(Jang et al., 11 Dec 2025)
Rubric criteria $r=\{c_1,\dots,c_K\}$	Rubric-conditioned teacher distillation	(Gu et al., 17 Jun 2026)
Observed feedback and observability $o_i$	Noise-aware and propensity-corrected loss	(Wang et al., 19 Mar 2026)

The conditioning can enter at different stages. Some systems use annotations only during training and predict the relevant structure at inference. VL-MDR is annotation-supervised during training but annotation-predictive at inference, selecting relevant dimensions from the multimodal query itself; CB-RM uses concept annotations to shape an intermediate bottleneck but does not require concepts at deployment; RCSD conditions a privileged teacher on rubrics during training while the student reasoner receives only the question at test time. By contrast, MCDPO conditions the implicit reward itself on a preference outcome vector and then exposes inference-time axis control through classifier-free guidance (Chen et al., 7 Apr 2026, Laguna et al., 7 Jul 2025, Gu et al., 17 Jun 2026, Jang et al., 11 Dec 2025).

2. Answer- and outcome-conditioned process supervision

One direct form of annotation-conditioned reward modeling appears in process reward modeling for reasoning. In "Efficient Process Reward Modeling via Contrastive Mutual Information" (Lee et al., 12 Apr 2026), the annotation signal is the known correct final answer $A$ together with incorrect answers $\tilde A$ . Instead of estimating step value by repeated Monte Carlo continuation success,

$r^{(i)}_{\text{MC}} = \frac{1}{T}\sum_{t=1}^{T} \mathbf{1}\!\left\{ \mathrm{Ans}\!\big (\tau^{(t)} \mid s_{\le i}\big) = A \right\},$

the paper defines a contrastive pointwise mutual information reward

$r_{\text{CPMI}}^{i} = \big[\log p_{\theta}(A \mid q, s_i) - \log p_{\theta}(A \mid q)\big] - \frac{1}{M}\sum_{m=1}^{M} \big[\log p_{\theta}(\tilde A \mid q, s_i) - \log p_{\theta}(\tilde A \mid q)\big].$

The score is therefore answer-conditioned at the label-construction stage: a step is rewarded insofar as it increases likelihood of the gold answer and decreases likelihood of hard negatives. The paper further averages over diversified prompt templates, uses answer-sequence length normalization, and interprets the resulting signal as a sample-based proxy for the Jeffreys divergence between answer distributions with-step and without-step. Empirically, CPMI-based labeling reduces dataset construction time by $84\%$ and token generation by $\tilde A$ 0 relative to Monte Carlo estimation, while improving ROC-AUC, ProcessBench, PRMBench, and MATH500 best-of-8 accuracy; it also shows that the contrastive negative term is essential, since the ablated gold-only variant performs poorly and is described as vulnerable to reward hacking (Lee et al., 12 Apr 2026).

A neighboring but distinct line is "Conditional Reward Modeling for LLM Reasoning" (Zhang et al., 30 Sep 2025). That work is not annotation-conditioned in the strict sense of taking annotation content as input, but it is outcome-conditioned and uses process annotations in training. It introduces a latent first-wrong-step variable $\tilde A$ 1, survival probability

$\tilde A$ 2

and dense process reward

$\tilde A$ 3

Training uses final correctness labels $\tilde A$ 4 and first-error supervision $\tilde A$ 5, so the reward semantics are explicitly linked to whether the trajectory remains on a path capable of reaching the correct final answer. The paper argues that this resolves ambiguous credit assignment by making each step’s contribution reconstruct the final correctness probability. It reports improvements in Best-of- $\tilde A$ 6, beam search, and RL, and emphasizes stronger robustness to repetitive reward-hacking behavior than vanilla PRM or PQM (Zhang et al., 30 Sep 2025).

Taken together, these reasoning papers establish a central pattern: annotations can be used not merely as targets for scalar regression, but as conditioning variables that define what it means for a local step to be useful. In answer-grounded tasks, this yields especially direct formulations because the annotation can be a checkable final answer rather than a free-form judgment.

3. Probabilistic, ordinal, and absolute-quality formulations

A second major direction conditions reward learning on coarse quality annotations so that the model represents absolute quality rather than only relative preference. "Learning Ordinal Probabilistic Reward from Preferences" (Chen et al., 13 Feb 2026) introduces the Probabilistic Reward Model (PRM) and its discrete realization, the Ordinal Probabilistic Reward Model (OPRM). Instead of a deterministic scalar reward, OPRM predicts a probability mass function over ordinal ratings $\tilde A$ 7, and pairwise preference becomes a probability-of-superiority: $\tilde A$ 8 The annotation-conditioned component enters through quality-level labels $\tilde A$ 9, mapped to sub-regions $\gamma$ 0, $\gamma$ 1, and $\gamma$ 2. Region Tuning constrains support to annotation-consistent regions, while Region Flooding Tuning replaces rigid rectangular support with a lower-triangular flooded region so that order-sensitive gradients are preserved within the coarse labels. The paper reports that OPRM-Qwen2.5-32B reaches an Overall score of $\gamma$ 3 and Overall* $\gamma$ 4, while OPRM-RgFT improves calibration further, reducing RewardBench ECE-10 from $\gamma$ 5 for the Qwen-2.5-32B baseline to $\gamma$ 6 for OPRM-32B and $\gamma$ 7 for OPRM-RgFT-32B (Chen et al., 13 Feb 2026).

The same paper makes the conditioning mechanism explicit: the annotations do not act as a separate classifier target but determine where probability mass should lie in ordinal reward space. This is a strong form of annotation-conditioning because it changes the support of the objective itself. A related implication is that absolute scores such as $\gamma$ 8 or $\gamma$ 9 acquire semantic meaning through alignment with good/normal/bad regions, rather than being only latent ranks.

"Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling" (Duan et al., 11 Feb 2026) addresses a different limitation of annotation-trained scalar rewards: their susceptibility to noise and systematic bias such as length or style. BNRM keeps the Bradley–Terry likelihood but replaces the dense scalar head with a non-negative latent-factor model,

$r=\{c_1,\dots,c_K\}$ 0

where local latent variables $r=\{c_1,\dots,c_K\}$ 1 capture instance-specific factor activations and global factors $r=\{c_1,\dots,c_K\}$ 2 provide sparse population-level weighting. The paper interprets this as disentanglement-then-debiasing: local sparsity in $r=\{c_1,\dots,c_K\}$ 3 yields parts-based reward structure, while global sparsity in $r=\{c_1,\dots,c_K\}$ 4 suppresses spurious factors that explain annotations without generalizing as reward. Empirically, on RM-Bench Hard the Pearson correlation between response length and reward falls from $r=\{c_1,\dots,c_K\}$ 5 for vanilla BT to $r=\{c_1,\dots,c_K\}$ 6 for BNRM, and the model shows improved resistance to over-optimization in Best-of- $r=\{c_1,\dots,c_K\}$ 7 and PPO settings (Duan et al., 11 Feb 2026).

These probabilistic formulations do not all require external annotations at inference, but they make annotation structure explicit at training time. A plausible implication is that annotation-conditioned reward models become more useful when the intended deployment requires thresholding, filtering, or calibrated uncertainty, not merely pairwise reranking.

4. Dimension-, concept-, and rubric-conditioned decomposition

A third family decomposes reward into interpretable subcomponents and conditions aggregation on annotation structure. "Learning What Matters: Dynamic Dimension Selection and Aggregation for Interpretable Vision-Language Reward Modeling" (Chen et al., 7 Apr 2026) is exemplary. VL-MDR uses a dataset of 321.3k retained preference pairs annotated with top-3 relevant dimensions drawn from 21 fine-grained dimensions under 7 core capabilities. Each sample is

$r=\{c_1,\dots,c_K\}$ 8

where $r=\{c_1,\dots,c_K\}$ 9 indicates relevant dimensions, $o_i$ 0 gives sparse per-dimension preferences, and $o_i$ 1 is the overall preference. The model predicts dimension relevance from the multimodal query, per-dimension scores from the response, and adaptive weights over the selected dimensions; the final reward is

$o_i$ 2

The annotation-conditioned mechanism is two-level: $o_i$ 3 supervises routing and $o_i$ 4 supervises the dimension-specific reward components. On VL-RewardBench, the scalar baseline reaches 64.55 overall / 60.87 macro, whereas VL-MDR reaches 70.81 / 69.96; the Top- $o_i$ 5 sweep peaks at $o_i$ 6, matching the top-3 annotation schema (Chen et al., 7 Apr 2026).

"Interpretable Reward Modeling with Active Concept Bottlenecks" (Laguna et al., 7 Jul 2025) makes the decomposition explicit at the concept level. CB-RM predicts a concept vector $o_i$ 7, prompt-conditioned concept weights $o_i$ 8, and reward

$o_i$ 9

The concepts are dataset-defined human-interpretable attributes—helpfulness, correctness, coherence, complexity, verbosity, instruction following, truthfulness, honesty, safety, and readability—and the supervision consists of relative binary concept labels $A$ 0 for pairwise comparisons. The model is probabilistic at the concept level through a diagonal Gaussian concept head, so active concept acquisition can target uncertainty in individual concept-instance pairs. This is annotation-conditioned reward modeling in a bottleneck form: the reward depends on interpretable concept predictions that are themselves trained against selective concept annotations (Laguna et al., 7 Jul 2025).

"Multi-dimensional Preference Alignment by Conditioning Reward Itself" (Jang et al., 11 Dec 2025) moves the same logic into diffusion alignment. MCDPO observes that a globally preferred image may still lose on a specific dimension such as aesthetics or semantic alignment, so scalar Bradley–Terry aggregation creates reward conflict. It therefore constructs a preference outcome vector

$A$ 1

and defines a conditional implicit reward

$A$ 2

The conditioning signal comprises five axes—the original human preference label, PickScore, Aesthetic, HPSv2, and CLIP—and dimensional reward dropout sets selected $A$ 3 during training to prevent easy dimensions from dominating optimization. The resulting conditional framework supports inference-time axis control through classifier-free guidance and outperforms scalar DPO-style baselines on Stable Diffusion 1.5 and SDXL (Jang et al., 11 Dec 2025).

"Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation" (Gu et al., 17 Jun 2026) replaces scalar reward optimization entirely with annotation-conditioned teacher guidance. A rubric is defined as

$A$ 4

where each criterion has a title, description, and weight/category such as Essential, Important, Optional, or Pitfall. The student reasoner $A$ 5 is trained on its own sampled rollouts $A$ 6, while a privileged teacher conditioned on $A$ 7, rubric $A$ 8, and prefix $A$ 9 supplies token-level supervision: $\tilde A$ 0 The paper frames this as a structured alternative to both scalar reward RL and single-rationale distillation, and reports average gains of 1.0 points over GRPO and 0.9 points over OPSD on science reasoning benchmarks (Gu et al., 17 Jun 2026).

5. Annotation efficiency, active acquisition, and alternative feedback channels

Because richer annotations are costly, a substantial subliterature focuses on which annotations to collect and how to replace them when possible. "Reviving The Classics: Active Reward Modeling in LLM Alignment" (Shen et al., 4 Feb 2025) formulates reward-model annotation as an active experimental-design problem. For linearized Bradley–Terry reward models, the Fisher information matrix is

$\tilde A$ 1

which decomposes informativeness into geometric diversity of feature differences and uncertainty through $\tilde A$ 2. D-optimal and past-aware D-optimal selection therefore prioritize comparisons that are both nontrivial and nonredundant. The paper reports that these strategies outperform entropy sampling, maxdiff, BatchBALD, coreset, and random selection across Gemma-2B, Gemma-7B, and LLaMA3-8B, and that cross-prompt comparisons significantly enhance labeling efficiency (Shen et al., 4 Feb 2025).

Active acquisition also appears at finer granularity. CB-RM chooses not only which instances to annotate but which concept within which instance, using Expected Information Gain,

$\tilde A$ 3

and shows faster concept learning than random or variance-only querying without sacrificing preference accuracy (Laguna et al., 7 Jul 2025).

In sequential control, "Advantage Reward Modeling for Long-Horizon Manipulation" (Mao et al., 3 Apr 2026) replaces dense scalar progress annotation with tri-state labels over short observation pairs: Progressive, Regressive, and Stagnant. ARM trains a multimodal temporal transformer with interval classification and completion heads, reconstructs dense progress $\tilde A$ 4, and then derives chunk-level gains

$\tilde A$ 5

for advantage-weighted behavior cloning. The annotation protocol is explicitly cheaper: human tri-state labeling yields 250 samples per 8-hour shift versus 100 for manual subtask segmentation, while ARM auto tri-state labeling exceeds 2000 samples per 8-hour shift on one A100. On the long-horizon towel-folding task, AW-BC with ARM reaches 99.4% success, compared with 78.5% for RA-BC using SARM and 62.1% for the BC baseline (Mao et al., 3 Apr 2026).

Several neighboring methods replace explicit annotations with weaker surrogates. GAN-RM treats reward modeling for visual generation as binary discrimination between a small set of preferred target samples and ordinary generator outputs, requiring only a few hundred proxy samples rather than pairwise human labels; on SD1.5 and SDXL it matches or exceeds annotation-heavy baselines in image quality, safety, and video experiments (Liu et al., 16 Jun 2025). ARF-RLHF converts free-form user follow-up language into continuous rewards via a RoBERTa-mini emotion analyzer trained on Emotion3, then adapts those scores online with replay and rescoring; it reports more than 70% accuracy on GoEmotions, Sentiment140, and DailyDialog and downstream gains of 3.3% over PPO and 7.6% over DPO across several small LLM backbones (Zhang, 3 Jul 2025). These methods are not annotation-conditioned in the strictest sense, but they broaden the topic by showing how reward can be conditioned on proxy distributions or on transformed free-form feedback rather than on canonical pairwise labels.

6. Robustness, causal correction, and theoretical limits

Annotation-conditioned reward modeling is partly motivated by negative results about scalar reward learning. "The Representation-Rationalizability Tradeoff in Reward Learning" (Dong et al., 29 May 2026) formalizes RLHF with heterogeneous annotators and shows that any reward built on a representation $\tilde A$ 6 incurs an exact excess-loss decomposition into an embedding term and an aggregation term: $\tilde A$ 7 A richer representation reduces embedding loss but can expose more preference cycles, increasing aggregation cost; pooled preferences from heterogeneous annotators need not be rationalizable by any single scalar reward. The paper presents this as a theoretical motivation for conditional or annotator-aware reward models, while also warning that conditioning merely relocates, rather than abolishes, the tradeoff (Dong et al., 29 May 2026).

"Reward Gaming in Conditional Text Generation" (Pang et al., 2022) provides the complementary failure taxonomy under optimization. It identifies three classes of errors in reward models learned from annotations: noise-induced spurious correlation, naturally occurring spurious correlation, and covariate shift. The synthetic Sudoku examples are especially stark: a reward model with 99.3% i.i.d. test accuracy can still drive RL toward invalid completions when a tiny mislabeled pattern is amplified, and a model with 96.5% accuracy can assign high reward to out-of-support invalid outputs. The same phenomena appear in MT with MQM-derived token rewards and in summarization with a faithfulness classifier. The central lesson is that held-out accuracy on the annotation distribution is insufficient once the learned reward becomes an optimization target (Pang et al., 2022).

"CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks" (Wang et al., 19 Mar 2026) makes the annotation mechanism explicit. It distinguishes latent true preference $\tilde A$ 8, observed noisy feedback $\tilde A$ 9, and observability $r^{(i)}_{\text{MC}} = \frac{1}{T}\sum_{t=1}^{T} \mathbf{1}\!\left\{ \mathrm{Ans}\!\big (\tau^{(t)} \mid s_{\le i}\big) = A \right\},$ 0, then combines a class-conditional noise model $r^{(i)}_{\text{MC}} = \frac{1}{T}\sum_{t=1}^{T} \mathbf{1}\!\left\{ \mathrm{Ans}\!\big (\tau^{(t)} \mid s_{\le i}\big) = A \right\},$ 1 with propensity correction $r^{(i)}_{\text{MC}} = \frac{1}{T}\sum_{t=1}^{T} \mathbf{1}\!\left\{ \mathrm{Ans}\!\big (\tau^{(t)} \mid s_{\le i}\big) = A \right\},$ 2. The resulting IPS and doubly robust objectives correct both annotation corruption and selective observation: $r^{(i)}_{\text{MC}} = \frac{1}{T}\sum_{t=1}^{T} \mathbf{1}\!\left\{ \mathrm{Ans}\!\big (\tau^{(t)} \mid s_{\le i}\big) = A \right\},$ 3

$r^{(i)}_{\text{MC}} = \frac{1}{T}\sum_{t=1}^{T} \mathbf{1}\!\left\{ \mathrm{Ans}\!\big (\tau^{(t)} \mid s_{\le i}\big) = A \right\},$ 4

The paper proves unbiasedness under its assumptions and reports substantial downstream improvements in safety alignment, including a 49.2% gain on WildGuardMix and a 32.7% improvement on HarmBench relative to naive reward modeling (Wang et al., 19 Mar 2026).

The present landscape therefore does not support a single canonical design. Instead, it suggests a set of recurring design choices. Reward models can be conditioned on answers, quality regions, dimensions, concepts, rubrics, local transitions, or observational feedback mechanisms; they can expose that conditioning in the reward parameterization itself, in a privileged teacher, or only in label construction; and they trade off calibration, interpretability, annotation cost, and robustness in different ways. A plausible synthesis is that annotation-conditioned reward modeling is most compelling when the annotation captures structure that a scalar preference label discards—especially absolute quality levels, decomposed criteria, stepwise causality, or known biases in the annotation channel—and least compelling when that structure cannot be identified reliably or is too costly to obtain.