Reward-Guided Diffusion Models

Updated 3 July 2026

Reward-Guided Diffusion is a generative modeling technique that incorporates reward functions into the denoising process to bias outputs toward user-specified objectives.
It leverages methods such as reward-gradient denoising, Metropolis–Hastings sampling, and iterative refinement to balance model fidelity with task-specific rewards.
Empirical evaluations demonstrate enhanced performance in motion synthesis, molecular design, and feature engineering, while addressing challenges like reward hacking and diversity tradeoffs.

Reward-Guided Diffusion is a class of generative modeling techniques that explicitly incorporate reward functions—quantitative measures of sample quality, alignment, or performance—into the diffusion process. Traditionally, diffusion models stochastically transform noise into data through a sequence of denoising steps, learning to match the target data distribution. Reward-guided diffusion augments this process by integrating both differentiable and non-differentiable reward signals, thereby biasing sampling or learning so that the resulting outputs not only resemble the data but also optimize for user-specified objectives. Techniques in this paradigm have demonstrated substantial improvements across domains such as motion synthesis, molecular and sequence design, behavioral policy learning, feature engineering, and more. Methods span both inference-time algorithms and training-time fine-tuning frameworks; many are designed to be plug-and-play, requiring no retraining of the diffusion backbone.

1. Mathematical Principles of Reward-Guided Diffusion

Central to reward-guided diffusion is the idea of sampling from the reward-tilted distribution: $p^{(\alpha)}(x_0) \propto p_{\text{pre}}(x_0) \exp\big(r(x_0)/\alpha\big)$ where $p_{\text{pre}}$ is the distribution defined by the pretrained diffusion model, $r(x_0)$ is the (possibly non-differentiable) scalar reward function, and $\alpha$ (or sometimes $\lambda$ ) controls the strength of the reward bias relative to model fidelity (Uehara et al., 16 Jan 2025, Jiao et al., 4 Dec 2025, Dandapanthula et al., 1 Jun 2026).

For continuous models (e.g., DDPM, SDE-based), sampling the reward-tilted distribution can be achieved by modifying the reverse diffusion stochastic differential equation (SDE): $dx = [f(x,t) - g^2(t)\nabla_x(\log p_t(x) + R_\varphi(x,t,y))]dt + g(t)d w$ where $R_\varphi$ is the reward function, potentially time- and condition-dependent (Weng et al., 8 May 2025, Weng et al., 24 Nov 2025).

In discrete state spaces (e.g., sequence or SMILES generation), the target is often approximated via Metropolis–Hastings techniques or importance-sampling-based guidance (Phunyaphibarn et al., 10 Feb 2026, Uehara et al., 20 Feb 2025).

Reward information is injected in several algorithmic loci:

Score adjustment: Directly adding reward model gradients to denoising steps (classifier guidance, reward-guided mean shift) (Jiao et al., 4 Dec 2025, Weng et al., 8 May 2025).
Reverse kernel modification: Forming reward-aligned reverse transitions $p(x_{t-1} | x_t) \propto p_{\theta}(x_{t-1} | x_t) \exp(\gamma R(x_t))$ (Weng et al., 8 May 2025).
Proposal and resampling: Proposals are drawn from standard backward kernels and then reweighted or accepted according to reward differentials (Metropolis–Hastings acceptance, SMC) (Phunyaphibarn et al., 10 Feb 2026, Uehara et al., 20 Feb 2025, Keramati et al., 2 Aug 2025).
Value-function based stochastic policies: Soft value approximators are used as look-ahead proxies for non-myopic reward optimization, either trained explicitly or approximated on-the-fly (Uehara et al., 16 Jan 2025, Keramati et al., 2 Aug 2025).

Reward-guided diffusion admits multiple theoretical underpinnings:

Entropy-regularized control: The optimal reward-aligned denoising process is the solution to an entropy-regularized Markov Decision Process—with KL regularization keeping trajectories close to the pretrained model (Su et al., 1 Jul 2025, Zhang et al., 2023, Keramati et al., 2 Aug 2025).
Doob $h$ -transform: Sampling from the reward-tilted distribution corresponds to a Doob $h$ -transformed SDE, with $p_{\text{pre}}$ 0 (Dandapanthula et al., 1 Jun 2026, Jiao et al., 4 Dec 2025).
Score-difference guidance: Classifier-free guidance and reward guidance can both be seen as injecting the difference between the original and reward-reweighted score functions into the reverse SDE drift (Jiao et al., 4 Dec 2025).

2. Reward Model and Step-Aware Guidance Architectures

Reward-guided diffusion relies on surrogate models that measure sample quality. Architectures depend on the domain:

Step-aware reward models are crucial in text-to-motion and bilingual motion synthesis. They are typically transformer-based and accept noised samples concatenated with timestep tokens. These networks decompose the reward into text-alignment (cosine similarity in semantic embedding space) and motion-alignment with reference trajectories (Weng et al., 8 May 2025, Weng et al., 24 Nov 2025).
In motion, the reward output can combine scores for semantic alignment (e.g., $p_{\text{pre}}$ 1) and motion realism ( $p_{\text{pre}}$ 2): $p_{\text{pre}}$ 3 Weights can be time-varying, allowing early sampling steps to emphasize global structure, with later steps refining alignment (Weng et al., 8 May 2025).
In discrete diffusion for molecules or sequences, reward functions are typically non-differentiable (e.g., molecule validity, binding affinity). Approaches either avoid using intermediate (noisy) rewards due to their high variance, instead evaluating only on final "clean" samples (Phunyaphibarn et al., 10 Feb 2026), or employ process reward models in language reasoning (Miles et al., 26 Feb 2026).

3. Reward-Guided Sampling Algorithms

Reward-guided sampling modifies the denoising loop to favor high-reward outcomes. Representative algorithms include:

Reward-Gradient Denoising: In each step, the standard denoising output is shifted by the reward model gradient, as in

$p_{\text{pre}}$ 4

allowing plug-in use with any pretrained model (Weng et al., 8 May 2025, Weng et al., 24 Nov 2025).

Metropolis–Hastings and Clean-Sample MCMC: For discrete models where gradient information is unavailable or unstable, reward alignment is achieved via proposals from the forward-backward chain and an accept–reject step based solely on final-sample rewards, guaranteeing a stationary distribution proportional to $p_{\text{pre}}$ 5 (Phunyaphibarn et al., 10 Feb 2026).
Soft-value and importance-sampling guidance: Value-based policies are estimated, and SMC or importance resampling is applied throughout the diffusion trajectory (Uehara et al., 16 Jan 2025, Uehara et al., 20 Feb 2025).
Iterative Refinement: Reward-guided iterative refinement alternates partial noising and denoising phases, correcting errors of prior passes and enabling hard constraints or improved value approximations (Uehara et al., 20 Feb 2025).
Noise-Tilted Reverse Kernels: For high speed and stability, the NoiseTilt approach injects reward-gradient directions through the noise term, after applying a custom whitening operator, rather than modifying the mean. This is crucial to preserve sample fidelity and avoid out-of-support points (Hwang et al., 16 Jun 2026).
Search and Stitching: In reasoning and code generation, reward-guided diffusion is implemented by generating a large pool of partial candidates, scoring them with a process reward model, and then "stitching" the best steps into a composite solution (e.g., for chain-of-thought in math problems) (Miles et al., 26 Feb 2026).

4. Training and Optimization Regimes

Reward guidance can be used either at test time with a frozen pretrained model, or for fine-tuning:

Test-time guidance: Most methods, such as ReAlign or CSMC, steer pretrained models toward the reward-tilted distribution with no retraining, providing plug-and-play compatibility (Weng et al., 8 May 2025, Phunyaphibarn et al., 10 Feb 2026, Hwang et al., 16 Jun 2026).
Fine-tuning and distillation: Iterative distillation optimizes the model for explicit reward alignment, minimizing KL divergence between a soft-optimal reward policy and the network policy, and can tackle non-differentiable rewards robustly (Su et al., 1 Jul 2025, Zhang et al., 2023). Reward-weighted supervised loss and off-policy distillation mechanisms improve stability over traditional RL-based objectives (Su et al., 1 Jul 2025).
Entropy-regularized and value-weighted losses: Training objectives often combine log-likelihoods reweighted by reward, maximum-entropy regularization, or reward-weighted MLE (Zhang et al., 2023, Su et al., 1 Jul 2025, Keramati et al., 2 Aug 2025).
Hierarchical variational policies: Fast reward-guided sampling has been achieved by amortizing control into latent policies using transformer-based architectures, supporting few-step sampling with high quality (Pandey et al., 20 May 2026).

5. Empirical Evaluation and Applications

Reward-guided diffusion demonstrates significant advantages in a range of application domains:

Text-to-Motion Generation: The ReAlign method achieves R-Precision@3 improvements of +2.8% to +7.4% and up to −55% reduction in FID over state-of-the-art baselines. Motions exhibit sharper alignment with input text and improved realism, without retraining the backbone (Weng et al., 8 May 2025, Weng et al., 24 Nov 2025).
Molecule and Sequence Design: Clean-Sample Markov Chain guidance nearly doubles or triples reward scores on QED, ring count, and synthetic accessibility over best-of-N and SMC baselines, maintaining high diversity while handling non-differentiable rewards (Phunyaphibarn et al., 10 Feb 2026).
Behavioral and Policy Generation: DIDI policies trained with diffusion guidance and reward incorporation surpass standard offline RL in returns, with 20% higher diversity and substantially higher fine-tuning success rates (Liu et al., 2024).
Feature Transformation: Reward-guided latent diffusion models produce feature sets that improve classifier F1 scores by 3–10% and regression metrics (1–RAE) by up to 10% across 14 benchmarks (Gong et al., 21 May 2025).
Image and Video Tasks: LatSearch leverages reward signals at the latent level, yielding a 6.4% boost in perceptual video scores at a 2.1 $p_{\text{pre}}$ 6 time cost (order of magnitude less than evolutionary search), and NTRK achieves state-of-the-art reward alignment on aesthetic and alignment tasks at a fraction of the computational cost (Zhao et al., 15 Mar 2026, Hwang et al., 16 Jun 2026).
Design Optimization: Reward-directed diffusion for airfoil and hull optimization achieves a 10% lift-to-drag ratio improvement and 25% resistance reduction over training data, extrapolating to previously unattainable designs (Keramati et al., 2 Aug 2025).

Representative empirical results and metric improvements:

Domain	Reward-guided Gain vs. Baseline	Notable Score/Metric
Motion Synthesis	+7.4% R-Prec@3, −55% FID	BiMD+ReAlign: R-Prec@3: 84.7%, FID: 0.178
Molecule Generation	+0.31 QED, +6.1 rings	CSMC: QED=0.91 vs Baseline=0.60
Feature Engineering	+3–10% F1, +4–5% regression RAE	DIFFT: <0.5s per inference
Image/Video Generation	+0.71 Aesthetic, +0.017 PickScore, +3.5 VR	NTRK or LatSearch, <1/20 baseline compute

6. Limitations and Theoretical Insights

Reward-guided diffusion faces fundamental challenges:

Reward Hacking: Empirical works demonstrate that plug-in estimators (finite-sample approximations of the optimal Doob $p_{\text{pre}}$ 7-function) can induce two biases: excessive contraction within modes (reward hacking) and failure to select between separated high-reward modes (Dandapanthula et al., 1 Jun 2026). Closed-form reward damping and best-of-N strategies address these but do not fully resolve theoretical limitations.
Diversity–Optimality Tradeoff: Aggressive reward scaling can collapse sample diversity; tuning of guidance scales and hyperparameters is essential (Jiao et al., 4 Dec 2025).
Computation and Memory: Resampling- and SMC-based algorithms, as well as lookahead pools, can incur significant computational and memory cost, though advanced sampling techniques such as LiDAR, NTRK, and hierarchical variational policies mitigate these overheads (Kim et al., 3 Feb 2026, Hwang et al., 16 Jun 2026, Pandey et al., 20 May 2026).
Reward Model Quality: The reliability of empirical gains depends on the calibration and alignment of reward models, particularly for non-differentiable or extrinsic signals.

7. Outlook and Extensions

Reward-guided diffusion now offers a unified set of tools for both training-time and test-time conditional generation, with support for black-box, non-differentiable, and differentiable rewards. Recent advances include:

Unified Theory: Score-difference guidance frameworks synthesize classifier-free and general reward guidance (Jiao et al., 4 Dec 2025).
Algorithmic Innovations: Step-aware architectures, hierarchical variational amortization, iterative refinement, clean-sample MCMC, and noise-tilted sampling represent major advances across tasks and modalities.
Generalization: Techniques have been adapted across imagery, motion, chemistry, biology, language, and behavior, supporting both plug-and-play and fine-tuned deployments.
Open Directions: Ongoing work targets principled value approximators, scalable search-based guidance, integration with RL fine-tuning, adaptive or learned guidance schedules, and broader support for multi-objective and constrained optimization scenarios (Uehara et al., 16 Jan 2025, Dandapanthula et al., 1 Jun 2026, Uehara et al., 20 Feb 2025).

Collectively, reward-guided diffusion provides a mathematically grounded, empirically validated, and highly general framework for controlled generative modeling across domains (Weng et al., 8 May 2025, Phunyaphibarn et al., 10 Feb 2026, Jiao et al., 4 Dec 2025, Dandapanthula et al., 1 Jun 2026, Hwang et al., 16 Jun 2026, Liu et al., 2024, Uehara et al., 20 Feb 2025).