Reward-Guided Diffusion & Stochastic Control

Updated 15 January 2026

The paper presents a unified framework that reformulates diffusion model sampling as an entropy-regularized stochastic optimal control problem to maximize terminal rewards.
It details a mathematical formulation using SDEs and Hamilton–Jacobi–Bellman equations to derive optimal Gaussian policies for controlled diffusion.
Empirical insights demonstrate enhanced sample detail, robust reward maximization, and improved efficiency in applications like image editing and biomolecular design.

Reward-guided diffusion as stochastic optimal control (SOC) is a framework wherein the sampling or training dynamics of diffusion models are formulated as an entropy-regularized continuous-time control problem. The control aims to maximize specific terminal rewards (e.g., downstream utility), while retaining fidelity to the original data-generating distribution. SOC unifies several previously distinct schemes for guided and conditional diffusion modeling, providing a principled foundation for both fine-tuning and inference-time alignment with non-differentiable objectives, including black-box or user-provided rewards. SOC-based diffusion methods encompass both continuous stochastic differential equations (SDEs) and their discrete analogues in the form of Markov decision processes (MDPs) and continuous-time Markov chains (CTMCs).

1. Mathematical Formulation: SDEs and Controlled Diffusion

In the continuous-time SOC framework for diffusion modeling, the forward SDE is typically an Ornstein–Uhlenbeck or variance-preserving (VP) process: $dx_t = -f(t)x_t\,dt + g(t)\,dB_t,\qquad x_0\sim p_0$ where $f(t)$ and $g(t)$ characterize the drift and diffusion schedule, and $p_0$ is the data distribution. The time-reverse dynamics (reverse SDE) are given by

$dz_t = \left[f(T-t)z_t + \bigl(g(T-t)\bigr)^2\nabla_x\log p_{T-t}(z_t)\right]dt + g(T-t)\,dW_t$

with the "score" $\nabla_x \log p_{T-t}(z_t)$ being, in practice, unknown.

Reward-guided control introduces an additive control field $a_t$ (or, more generally, $u_t$ ) in place of the unknown score, yielding the controlled reverse process: $dy_t = \left[f(T-t)y_t + \bigl(g(T-t)\bigr)^2 a_t\right]dt + g(T-t)\,dW_t$ The objective is to design a policy for $a_t$ , akin to a stochastic policy in RL, which steers the denoising trajectory to maximize the expected terminal reward (e.g., task-specific utility at $f(t)$ 0), subject to regularization encouraging proximity to the original diffusion measure (Gao et al., 2024).

2. Stochastic Optimal Control Objective and HJB Characterization

The reward-guided SOC problem seeks to maximize a value functional of the form: $f(t)$ 1 where $f(t)$ 2 penalizes deviations between the proposed control and the unknown score, $f(t)$ 3 weights exploration (entropy bonus), and $f(t)$ 4 is the terminal reward (Gao et al., 2024).

The value function $f(t)$ 5 satisfies a Hamilton–Jacobi–Bellman (HJB) PDE with a log-partition term due to entropy regularization: $f(t)$ 6 with the generalized Hamiltonian

$f(t)$ 7

This leads to an optimal Gaussian policy for the control, and, crucially, the control mean involves both the current score and the gradient of the value function (Gao et al., 2024, Jiao et al., 4 Dec 2025).

3. Policy Structure and Actor–Critic Algorithms

Given the quadratic nature of the SOC Hamiltonian in the control $f(t)$ 8, the entropy-regularized SOC yields a closed-form Gaussian optimal policy: $f(t)$ 9 with mean

$g(t)$ 0

The reward and policy structure allow for actor–critic q-learning implementations, where the critic approximates the value function and the actor parameterizes the Gaussian policy mean. Data-driven noisy estimates of the score can be constructed through ratio estimation based on forward process densities (Gao et al., 2024).

Training proceeds via temporal-difference learning using the martingale property of the value function, with stochastic updates of both actor and critic. Extensions accommodate probability flow ODEs and context-conditioned diffusion by augmenting the state to include auxiliary variables (Gao et al., 2024).

4. Connection to KL-Control, HJB, and Path-Integral SOC

Reward-guided diffusion is fundamentally a path-integral control problem: the optimal controlled measure over trajectories is a KL-tilted exponential reweighting of the uncontrolled process,

$g(t)$ 1

where $g(t)$ 2 is the terminal reward (Zhang et al., 2023, Su et al., 1 Jul 2025). The solution to the HJB PDE (or its Feynman-Kac representation) provides the desirability function or soft value, the gradient of which gives the optimal feedback. In discrete time, analogous KL-tilted distributions enable low-variance off-policy algorithms via the exponential pay-off distribution or forward-KL distillation (Zhang et al., 2023, Su et al., 1 Jul 2025).

Inference-time guidance methods, such as classifier guidance and Sequential Monte Carlo (SMC) approaches, can be interpreted as practical approximations to these SOC-driven optimal processes; they seek to sample from or emulate the pathwise Gibbs measure defined above (Uehara et al., 16 Jan 2025, Jiao et al., 4 Dec 2025). The same principles extend to discrete settings, where MDP/CTMC analogues and off-policy corrections via Monte Carlo Tree Search or replay buffers become central (Tang et al., 29 Sep 2025).

5. Loss Functions, Variance Reduction, and Memoryless Schedules

Optimization of reward-guided diffusion via SOC involves estimators spanning REINFORCE (score function), pathwise/adjoint, matching and KL-based losses (Domingo-Enrich, 2024, Domingo-Enrich et al., 2024). Variance reduction is a critical issue, and adjoint-matching losses based on lean adjoint ODEs and memoryless noise schedules provide favorable statistical properties—matching the expected gradient with reduced variance and ensuring unbiasedness with minimal bias from initial or final time dependencies (Domingo-Enrich et al., 2024).

Explicitly, memoryless schedules (e.g., $g(t)$ 3) ensure that tilting by the reward yields exact marginal alignment at terminal time, as opposed to naive KL-regularized objectives which introduce a residual bias. This property underpins the theoretical correctness of reward-guided tilting in both diffusion and flow-matching models (Domingo-Enrich et al., 2024).

6. Extensions and Application Scenarios

SOC-based reward-guided diffusion generalizes to:

Probability flow ODEs, where drift control applies to deterministic generative flows.
Conditional diffusion, augmenting the control and value function with context variables (Gao et al., 2024).
Discrete state spaces (e.g., sequence generation), with controlled CTMCs and trajectory replay via MCTS in discrete diffusion (Tang et al., 29 Sep 2025).
Training-free, inference-time guidance, and adaptive schedule selection (e.g., adaptive classifier-free guidance or training-free image editing), relying on Pontryagin’s Maximum Principle and efficient approximation-by-iteration in high-dimensional domains (Chang et al., 30 Sep 2025, Azangulov et al., 25 May 2025).

7. Empirical Insights and Practical Considerations

Empirically, reward-guided SOC reformulations yield substantial benefits over prior hard-constrained or heuristic approaches:

Improved sample sharpness and detail in diffusion bridges and image restoration tasks by tuning control/penalty weights away from strict endpoint constraints (Zhu et al., 9 Feb 2025).
Robust reward maximization and fidelity to source distributions in image editing, as demonstrated by trajectory control optimizing both reward alignment and content faithfulness (Chang et al., 30 Sep 2025).
Enhanced stability and sample efficiency in biomolecular design and other scientific domains, via forward-KL iterative distillation and off-policy data collection (Su et al., 1 Jul 2025).
Superior coverage and controllability in discrete settings, demonstrated by trajectory-aware fine-tuning with MCTS (Tang et al., 29 Sep 2025).

The choice of control cost, loss function, and optimizer is guided by trade-offs among bias, variance, and computational tractability, as summarized in the taxonomy of SOC loss functions (Domingo-Enrich, 2024). Memoryless fine-tuning and adjoint matching currently provide state-of-the-art consistency, stability, and sample diversity in reward-based diffusion fine-tuning (Domingo-Enrich et al., 2024).

References:

(Gao et al., 2024) Reward-Directed Score-Based Diffusion Models via q-Learning
(Zhang et al., 2023) Towards Controllable Diffusion Models via Reward-Guided Exploration
(Jiao et al., 4 Dec 2025) Towards a unified framework for guided diffusion models
(Su et al., 1 Jul 2025) Iterative Distillation for Reward-Guided Fine-Tuning of Diffusion Models in Biomolecular Design
(Domingo-Enrich, 2024) A Taxonomy of Loss Functions for Stochastic Optimal Control
(Domingo-Enrich et al., 2024) Adjoint Matching: Fine-tuning Flow and Diffusion Generative Models with Memoryless Stochastic Optimal Control
(Zhu et al., 9 Feb 2025) UniDB: A Unified Diffusion Bridge Framework via Stochastic Optimal Control
(Tang et al., 29 Sep 2025) TR2-D2: Tree Search Guided Trajectory-Aware Fine-Tuning for Discrete Diffusion
(Chang et al., 30 Sep 2025) Training-Free Reward-Guided Image Editing via Trajectory Optimal Control
(Uehara et al., 16 Jan 2025) Inference-Time Alignment in Diffusion Models with Reward-Guided Generation: Tutorial and Review
(Azangulov et al., 25 May 2025) Adaptive Diffusion Guidance via Stochastic Optimal Control