Reward-Based Fine-Tuning

Updated 12 June 2026

Reward-based fine-tuning is a technique that adjusts pretrained generative models using a scalar reward function, integrating human feedback, domain metrics, or proxy evaluations.
It employs diverse methodologies such as gradient-based backpropagation, RL-style policy gradients, and reward-weighted supervised learning to effectively steer model outputs.
The approach uses regularization strategies like KL divergence and entropy penalties to prevent reward overoptimization, maintaining balance between fidelity, diversity, and domain alignment.

Reward-based fine-tuning is a class of post-training procedures that adjust the parameters of a pretrained model—LLM, diffusion model, flow, or other generative architecture—using an explicit scalar reward function. The reward is typically learned from human feedback, domain-specific metrics, proxy evaluations, or scientific objectives. Across domains, the goal of reward-based fine-tuning is to steer a pretrained model to produce outputs with higher expected reward, subject to fidelity, diversity, or regularization constraints.

1. Theoretical Foundations and Objectives

Reward-based fine-tuning modifies a pretrained generative model $p_\theta$ to maximize the expected value of a scalar reward function $r(y, c)$ , where $y$ is the generated output and $c$ is an optional conditioning signal (e.g., prompt, context). The optimization objective is

$J(\theta) = \mathbb{E}_{y \sim p_\theta(y|c)}[r(y, c)]$

subject to constraints—often via regularization—that prevent divergence from the base model. In many settings, this is formulated as an entropy- or KL-regularized reinforcement learning (RL) objective:

$J_\lambda(\theta) = \mathbb{E}_{y \sim p_\theta}[r(y, c)] - \lambda \mathrm{KL}\left(p_\theta(\cdot|c)\|p_{\rm ref}(\cdot|c)\right)$

where $p_{\rm ref}$ is the reference (often base) model and $\lambda$ controls the reward–divergence trade-off (Ziegler et al., 2019, Kim et al., 2024, Lee et al., 19 Apr 2026).

Reward-based fine-tuning is distinguished by its use of model outputs in the reward calculation and its integration of reinforcement-style weightings, even when implemented with supervised or maximum-likelihood parameter updates.

2. Core Methodologies and Algorithms

A diversity of algorithms realize reward-based fine-tuning in practice. The following typologies capture the principal methods:

2.1. Gradient-based (Direct Backpropagation)

For differentiable $r(y, c)$ , the reward gradient can be backpropagated through the model's generative chain, allowing efficient fine-tuning. In the diffusion literature, DRaFT (Clark et al., 2023) and its variants (DRaFT-K, DRaFT-LV) compute

$\nabla_\theta J(\theta) = \mathbb{E}_\epsilon[\nabla_\theta r(x_0(\theta, \epsilon))]$

by chaining backward through all or part of the sampling procedure. Truncating the backward pass (as in DRaFT-K) or reducing estimator variance (DRaFT-LV) are key for computational stability and sample efficiency.

2.2. RL-Style Policy Gradient

When $r(y, c)$ 0 is non-differentiable (e.g., via human feedback or black-box scientific metrics), REINFORCE-style estimators and variants apply. The model is treated as a policy parametrization, and the update is

$r(y, c)$ 1

for an appropriate baseline $r(y, c)$ 2 (Ziegler et al., 2019, Hou et al., 10 Nov 2025). In flow- and diffusion-based models, the per-step chain is typically interpreted as a finite-horizon MDP, with rewards often given at the terminal state (Li et al., 12 Aug 2025, Jia et al., 14 Feb 2026).

2.3. Reward-Weighted Supervised Learning

Several frameworks interpolate between supervised and RL-based objectives using reward-weighted log-likelihoods. Weighted SFT (as in CRAFT (Sun et al., 19 Mar 2026)) or reward-weighted regression (RWR, as in (Kim et al., 2024)) assign per-sample weights according to reward signals, retaining computational efficiency and reducing pathological variance.

2.4. Surrogate or Latent-Space Rewards

For models with limited steps (e.g., step-distilled consistency models), learning a differentiable surrogate reward in the model's latent space enables effective gradient-based updates even for non-differentiable targets (Jia et al., 2024). The surrogate is trained to reproduce the target reward on off-policy generated latents.

2.5. Policy Distillation and Iterative Soft-Optimality

Iterative schemes emulate soft-optimal (entropy-regularized) policies by collecting off-policy rollouts, simulating reward-based targets, and distilling these via (forward) KL divergence minimization (Su et al., 1 Jul 2025). This approach is especially effective for stability in scientific applications with arbitrary domain-specific rewards.

2.6. Reward Score Matching (RSM) – Unified Objective

A recent synthesis, Reward Score Matching (RSM), reframes reward-based fine-tuning for both flow and diffusion models as regularized score matching toward a reward-tilted target distribution:

$r(y, c)$ 3

where $r(y, c)$ 4 is a value-guidance correction term derived from the reward, and $r(y, c)$ 5 optionally enforces trust-region constraints (Lee et al., 19 Apr 2026).

3. Reward Model Construction

In reward-based fine-tuning, the reward function is central and can be constructed via various means:

Supervised human preference modeling: Pairwise or listwise human judgments are distilled into scalar-valued reward models (Ziegler et al., 2019, Li et al., 12 Aug 2025, Chijiwa et al., 18 Feb 2025).
Process-level reward for dense feedback: Custom models such as Process Reward Models (PRM) evaluate each step in a trajectory rather than only the final output, enabling dense trajectory-level supervision (e.g., StepPRM-RTL (Vijayaraghavan et al., 2 Jun 2026)).
Physics-informed and scientific rewards: Direct supervision is provided by domain equations (e.g., PDE residuals in physics; (Yuan et al., 24 Sep 2025)), simulators, or mesh interactions (Jia et al., 14 Feb 2026).
Composite and confidence-calibrated rewards: Ensembles or combinations of proxy models (e.g., CRAFT composite filtering (Sun et al., 19 Mar 2026); TextNorm (Kim et al., 2024)) and confidence-aware normalization mitigate reward hacking and overoptimization.
Frequency-domain or feature-space rewards: Custom metrics such as Video Consistency Distance quantify aspects of temporal or structural consistency via spectral analysis (Aoshima et al., 22 Oct 2025).

Reward model selection and calibration are crucial to prevent misalignment, reward hacking, or overoptimization (Kim et al., 22 Mar 2026, Kim et al., 2024).

4. Optimization Strategies, Regularization, and Stability

Reward-based fine-tuning algorithms can amplify pathologies such as mode collapse, loss of diversity, or reward exploitation, motivating a wide range of optimization safeguards:

KL, entropy, or Wasserstein regularization: Penalizing divergence from the reference model (typically via KL (Kim et al., 2024, Ziegler et al., 2019) or weighted L2 in flow matching (Fan et al., 9 Feb 2025)), or directly imposing Wasserstein-2 distance penalties (to preserve distributional support).
Advantage normalization and reward weighting: Trajectory-level or group-wise normalization ensures updates focus on genuine high-reward improvements (e.g., CRAFT (Sun et al., 19 Mar 2026), StepPRM-RTL (Vijayaraghavan et al., 2 Jun 2026)).
Sharpness-aware and robustification techniques: Flattening the reward functional landscape via parameter or image perturbations mitigates reward hacking (Kim et al., 22 Mar 2026).
Truncation and selective updating: Limiting the backward-pass depth in time or parameter space controls stability and memory usage (Clark et al., 2023, Yuan et al., 24 Sep 2025).
Pruning and auxiliary network removal: Unnecessary complexity can be eliminated, as in RSM's demonstration that auxiliary correction networks add no benefit (Lee et al., 19 Apr 2026).

Algorithmic design must balance reward maximization, computational tractability, and distributional diversity, with adjustments made to gradient estimation strategy (first-order vs zeroth-order, branching depth, temporal reweighting) and regularization hyperparameters.

5. Applications Across Domains

Reward-based fine-tuning is now standard across multiple domains:

LLMs: Aligning model behavior to human (or user-specific) values, with techniques including standard RLHF (Ziegler et al., 2019), reward-informed SFT (RIFT (Liu et al., 14 Jan 2026)), and process-level reward learning for code synthesis (Vijayaraghavan et al., 2 Jun 2026).
Text-to-image and image-generation diffusion models: Improving human preference scores, semantic alignment, or physical plausibility with variants of DRaFT (Clark et al., 2023), RSA-FT (Kim et al., 22 Mar 2026), PIRF (Yuan et al., 24 Sep 2025), and physics-informed RLFT (Jia et al., 14 Feb 2026).
Video and sequential models: Fine-tuning temporal consistency via spectral-domain rewards (Aoshima et al., 22 Oct 2025).
Personalized and conditional generation: Personalized LLM alignment leveraging "small-shot" synthetic contrast sets and reasoning-augmented rewards (Li et al., 12 Aug 2025).
Biomolecular/Scientific generative modeling: Reward-guided design of proteins, small molecules, and regulatory DNA using policy distillation or iterative reward-guided distillation (Su et al., 1 Jul 2025).
Recommender systems: Fine-tuning diffusion-based recommenders via REINFORCE-style MDPs with top-K collaborative reward functions (Hou et al., 10 Nov 2025).

A summary of key methods, with references:

Methodology	Reward Type	Example Reference
DRaFT / Direct BPP	Differentiable	(Clark et al., 2023)
REINFORCE / PPO	Arbitrary	(Ziegler et al., 2019, Hou et al., 10 Nov 2025)
Reward-Weighted SFT	Differentiable	(Sun et al., 19 Mar 2026, Kim et al., 2024)
Surrogate Reward	Black-box	(Jia et al., 2024)
Iterative Distillation	Black-box	(Su et al., 1 Jul 2025)
Reward Score Matching	Generalized	(Lee et al., 19 Apr 2026)

6. Limitations, Challenges, and Practical Recommendations

Reward-based fine-tuning faces several technical and methodological challenges:

Reward Overoptimization and Hacking: Excessive optimization against imperfect or misaligned rewards (notably learned proxies) tends to degrade actual objective performance—e.g., generating outputs that maximize the score without real improvement in human-judged quality (Kim et al., 22 Mar 2026, Kim et al., 2024).
Variance–Bias–Compute Trade-offs: Zeroth-order estimators (for black-box rewards) are unbiased but potentially high-variance; first-order estimators (for differentiable rewards) may be biased, especially on low-SNR steps in diffusion. Branching strategies or value-based RL analogues mitigate some weaknesses (Lee et al., 19 Apr 2026, Jia et al., 2024).
Stability and Data Efficiency: Direct reward weighting can cause instability or collapse (RIFT (Liu et al., 14 Jan 2026) addresses this via a stabilized linear surrogate for negative rewards). Off-policy/iterative distillation approaches (VI-DD (Su et al., 1 Jul 2025)) can yield higher sample efficiency and training stability.
Reward Model Calibration and Confidence: Confidence-normalized rewards (e.g., TextNorm (Kim et al., 2024)) are necessary to avoid over-trusting unreliable scores.

Recommended practices for robust reward-based fine-tuning include:

Regular evaluation against human-aligned or scientifically valid metrics, not just proxy reward.
Confidence-aware reward normalization and ensemble-based filtering.
KL or Wasserstein regularization to control distributional drift and preserve diversity.
Empirical ablation of auxiliary complexities for efficiency.
Truncation, branching budget allocation, and sharpening/flattening techniques dependent on reward smoothness.

7. Unification and Future Research Directions

The emergence of unified frameworks such as Reward Score Matching (Lee et al., 19 Apr 2026) illustrates that a broad spectrum of previously disparate reward-based fine-tuning algorithms—including RLHF, score-based reinforcement learning, GFlowNets, and policy distillation—can be formulated under a common score-matching with reward-twisted target. This clarifies the distinctions between value-guidance estimation, estimator variance, temporal weighting, and regularization as the principal axes of the algorithmic design space.

Emerging directions include:

Process-level and stepwise reward modeling: Dense trajectory-level feedback, as demonstrated by StepPRM-RTL (Vijayaraghavan et al., 2 Jun 2026), enhances reasoning fidelity and correctness in long-horizon tasks.
Portable reward tuning: Explicit separation of reward learning and model fine-tuning enables compact, reusable tuning across model backbones (Chijiwa et al., 18 Feb 2025).
Physics-informed and scientific generative modeling: Harnessing domain-specific reward signals expands the role of fine-tuning into science and engineering optimization (Yuan et al., 24 Sep 2025, Jia et al., 14 Feb 2026).
Efficient exploration and sample efficiency: Dynamically modulating diversity and focusing optimization resources leads to improved convergence and generalization (Chae et al., 19 Feb 2025).
Robustification against reward mis-specification: Techniques for detecting and mitigating overoptimization and reward exploitation are critical for trustworthy deployment.

Reward-based fine-tuning will remain a central methodology in model alignment, generative synthesis, and bridging model output distributions to human or domain-specific desiderata, energized by ongoing advances in reward model construction, estimator theory, and optimization strategies.