Reward Models in ML: Fundamentals and Advances

Updated 30 May 2026

Reward Models are neural functions that score candidate outputs based on human preferences, domain objectives, and process-level feedback.
They employ diverse paradigms—including discriminative, generative, and probabilistic approaches—to optimize tasks in language and vision domains.
Recent advances focus on robustness, interpretability, and efficiency by leveraging ensembles, modular architectures, and process supervision to mitigate reward hacking and bias.

A reward model is a parameterized function—typically neural, often deep—that scores candidate outputs (e.g., language completions, action sequences, reasoning steps) in a manner intended to align with human preferences, domain-specific objectives, or other externally supplied desiderata. In contemporary machine learning, reward models serve as proxies for “true” or idealized objectives that may be expensive, ambiguous, or otherwise impractical to specify or evaluate directly. They are central to alignment methodologies such as Reinforcement Learning from Human Feedback (RLHF), policy optimization, preference learning, and preference-based reinforcement learning in both language and vision-language domains. Recent research emphasizes the multidimensional nature of reward modeling—spanning discrete and probabilistic paradigms, process- or outcome-level supervision, interpretability, robustness, bias, data efficiency, and integration with pipeline or multi-stage architectures.

1. Mathematical Foundations and Paradigms

Reward models are generally defined as learned functions

$r_\theta : (x, y) \to \mathbb{R}$

where $x$ is a context (prompt, state, or input) and $y$ a candidate output (response, action, trajectory). The principal training objective is to approximate human preference by maximizing the model’s accuracy on pairwise or listwise comparisons, e.g.,

$P(y_1 \succ y_2 | x) = \sigma(r_\theta(x, y_1) - r_\theta(x, y_2)),$

where $\sigma(z) = 1 / (1 + e^{-z})$ is the logistic sigmoid, and training is performed by minimizing the negative log-likelihood across annotated preference pairs $(x, y_1 \succ y_2)$ (Zhong et al., 12 Apr 2025, Yu et al., 18 Jun 2025).

Three major families of reward models have emerged:

Discriminative Reward Models (DRMs): Learn scalar reward functions via MLP heads attached to pre-trained LLMs or encoders. These are prevalent in both language (Zhong et al., 12 Apr 2025, Chen et al., 13 Feb 2026) and vision-language tasks (Zang et al., 21 Jan 2025).
Generative Reward Models (GRMs): Cast reward estimation as generative modeling—e.g., predicting next tokens corresponding to qualitative assessments, or generating chain-of-thought explanations before numerical scoring. GRAM demonstrates that unsupervised pre-training of comparisons yields improved generalization, with label smoothing interpreted as regularization of pairwise ranking loss (Wang et al., 17 Jun 2025).
Implicit Reward Models: Eschew explicit reward heads in favor of training policies directly to increase the log-probability of preferred responses relative to baselines, as in Direct Preference Optimization (DPO) (Zhong et al., 12 Apr 2025, Lambert et al., 2024).

Extensions include:

Probabilistic Reward Models (PRMs/OPRM): Output full distributions over ordinal rating scales (e.g., 1–9), modeling reward as a random variable $S$ with conditional density $p_\theta(s|x, y)$ and learning via closed-form likelihoods. OPRM applies this to discrete K-way ratings, yielding better-calibrated and distributionally informative outputs (Chen et al., 13 Feb 2026).
Step-Level (Process-Supervised) Reward Models: Assign rewards at each intermediate reasoning step rather than only at the outcome, enabling fine-grained navigation for reasoning or code tasks (Ma et al., 2023).

2. Preference Data, Supervision Schemes, and Training

Reward model fidelity depends critically on the quality and granularity of preference data (Zhong et al., 12 Apr 2025). Major collection modalities:

Human Preference: Pairwise comparisons (“Which of these two responses do you prefer?”), rating scales, or coarse quality levels (e.g., “bad,” “normal,” “good”) for region-based or ordinal supervision.
Synthetic/AI Preference: Feedback or pairings generated by “judge LLMs,” surrogate models, or verifier functions, especially for data augmentation or automated scaling (Liu et al., 29 Sep 2025, Zang et al., 21 Jan 2025).
Process-Level Annotations: Marking intermediate states or reasoning steps as positive, neutral, or negative; typically automated in code (e.g., test-case outcomes per step) or labeled in reasoning tasks (Ma et al., 2023).

Training objectives include:

Bradley–Terry Loss: Standard for pairwise preference, as above.
Cross-Entropy or Regression Loss: Non-pairwise scenario, e.g., for rating scales.
Label Smoothing and Regularization: Essential to avoid overfitting, especially in generative reward models (GRAM employs label smoothing to improve generalization) (Wang et al., 17 Jun 2025).
Semi-Supervised/Region Tuning: Leveraging small amounts of quality-annotated data to polarize probability mass in ordinal PRM/OPRM via Region Flooding Tuning (RgFT), effectively partitioning the rating space to encode (bad, normal, good) structure (Chen et al., 13 Feb 2026).

Recent advances focus on data efficiency (e.g., achieving strong calibration with minimal coarse labeling), robust training against spurious features (verbosity, response length), and preference elicitation methods such as rubric-based multi-dimensional judgments (Chen et al., 13 Feb 2026, Gehrmann, 3 Oct 2025).

3. Robustness, Reward Hacking, and Ensembles

A central challenge for reward models is robustness under distribution shift—specifically, the phenomenon of reward hacking. Policies optimized against imperfect RMs can exploit spurious patterns (e.g., verbosity bias, format artifacts) to receive high reward without genuinely aligning with human intent (Eisenstein et al., 2023, Fu, 30 Nov 2025). Key findings:

Overoptimization and Underspecification: Reward models matching in-distribution performance (preference accuracy) may diverge dramatically when used to align policy generation in-the-wild, indicating that loss landscapes are highly underspecified (Eisenstein et al., 2023).
Ensembles as Mitigation: Ensembles of reward models, particularly those differing by pre-training (not just fine-tuning) seeds, show improved generalization and dampen overoptimization, but shared inductive biases across the ensemble leave some vulnerabilities unresolved (Eisenstein et al., 2023).
Mixture-of-Experts Reward Models: Upcycled-MoE architectures introduce expert diversity with shared and instruction-specific experts, then merge back to maintain performance at minimal inference cost (Fu, 30 Nov 2025).
Adversarial Patching: Reward-guided controlled decoding (REFORM) generates adversarial examples—false positives or negatives—revealing and patching model failure modes with minimal retraining (Pathmanathan et al., 8 Jul 2025).
Process-based and Chain-of-Thought RMs: Reward Reasoning Models (RRM) and step-level process supervision improve judgment on complex tasks, leveraging structured deliberation and explicit reasoning for robust, interpretable rating (Guo et al., 20 May 2025, Ma et al., 2023).

4. Evaluation Benchmarks and Metrics

RewardBench: A reference suite comprising >2,500 manually verified prompt–chosen–rejected triplets across chat, instruction, safety, reasoning, and code, directly comparing RM predictions to ground-truth preference (Lambert et al., 2024).
Preference Proxy Evaluations (PPE): A large-scale evaluation linking proxy metrics (pairwise accuracy, ROC-AUC, best-of-K, calibration) to actual downstream RLHF performance as measured by human votes in chatbot arena deployments; demonstrates that simple pairwise accuracy most strongly predicts RLHF outcomes (Frick et al., 2024).
Overoptimization Sensitivity (γ): Degree of overoptimization, defined as the area between gold standard and proxy reward curves over increasing KL-divergence between policy and reference distributions (Kim et al., 19 May 2025).
Calibration Error (ECE): Measures alignment of RM scores with empirical preference frequencies, critical for thresholding (Chen et al., 13 Feb 2026).
Domain-Specific Metrics: For specialized domains (e.g., translation, code, vision), dedicated metrics (COMETKiwi, mT5-SEAHORSE, accuracy under verifiers) may outperform generic reward models; empirical studies caution against conflation of metrics and reward models (Gehrmann, 3 Oct 2025).

Advances in benchmark design include multi-pairwise comparisons, style and domain diversity, listwise scoring, and ablation on irrelevant distributional gaps to more reliably evaluate reward models’ systematic generalization and resistance to Goodhart’s Law (Kim et al., 19 May 2025).

5. Extensions: Multimodal, Structural, and Pipeline-Aware Reward Models

Reward modeling increasingly supports multimodal formats, process decomposition, and pipeline-adapted alignment:

Multimodal RMs: Integrate visual, textual, and video features and are trained on cross-domain preference data (Zang et al., 21 Jan 2025, Wang et al., 12 May 2025). Examples include InternLM-XComposer2.5-Reward and Skywork-VL Reward, both of which achieve or exceed proprietary baselines on VL-RewardBench.
Structural/Modular RMs: Structural architectures combine main-branch reward heads with auxiliary side-branches trained on interpretable features (e.g., facts, style, semantic relevance). Dimension-specific scores support diagnosis and debugging, especially in industrial pipelines (Liu et al., 29 Sep 2025).
Pipeline-Adaptive RMs: In settings with multi-stage LLM pipelines—e.g., formulation then code-generation—reward models are adapted for stage-wise feedback, using pipeline-internal data collection and direct preference optimization losses to maximize global execution objectives (e.g., code passes all verifications) rather than solely local fidelity (Fan et al., 20 Apr 2026).

6. Societal Considerations, Value Bias, and Interpretability

Research reveals that RMs inherit value biases from their base model pretraining (e.g., “agency” vs. “communion” preferences along psychological axes), persistent across model families and data scales (Christian et al., 28 Jan 2026). Alignment with sociodemographic groups is often poor, and reward models may reinforce existing stereotypes and social biases (Elle, 7 Oct 2025). Prompt-based steering is largely ineffective compared to data or architecture changes.

Recommendations call for:

Pretraining-level safety audits and intervention.
Data collection aimed at coverage of underrepresented groups and values.
Auditing and bias quantification of RMs prior to downstream alignment.
Investment in interpretable, uncertainty-aware reward models and calibration tools.

Interpretability mechanisms (e.g., step-level, chain-of-thought justifications, modular side-branches) and auxiliary metrics (calibration, proxy alignment) are critical for safe, transparent RM deployment (Liu et al., 29 Sep 2025, Zhong et al., 12 Apr 2025, Chen et al., 13 Feb 2026).

7. Practical Deployment, Data Efficiency, and Computational Cost

Modern reward models demonstrate substantial advances in efficiency:

OPRM achieves state-of-the-art performance with 2.6× inference speedup over standard DRMs and no extra model parameters; compatible with existing inference infrastructures (Chen et al., 13 Feb 2026).
Lightweight models (e.g., ELHSR) leverage hidden-state extraction to achieve competitive best-of-N performance with <0.005% of baseline parameter count and orders of magnitude faster inference (Guo et al., 18 May 2025).
Semi-supervised and pipeline-adapted models can scale using only modest manual annotation by exploiting automated verification or unlabeled preference pairings (Chen et al., 13 Feb 2026, Fan et al., 20 Apr 2026).
Ensemble and MoE-merging architectures maintain robust generalization with minimal (often just 1.2×) increased pretraining cost and negligible inference overhead when merged (Fu, 30 Nov 2025, Eisenstein et al., 2023).
Trade-offs include interpretability vs. throughput (GRMs, chain-of-thought RMs), rating tokenization flexibility vs. speed (OPRM), and dependence on preference data granularity vs. absolute calibration (RgFT).

Careful attention to data curation, regularization, modularity, and benchmark-driven validation is paramount to achieve both practical scalability and reliable human-value alignment.