Papers
Topics
Authors
Recent
Search
2000 character limit reached

Reward-Guided Diffusion & Stochastic Control

Updated 15 January 2026
  • The paper presents a unified framework that reformulates diffusion model sampling as an entropy-regularized stochastic optimal control problem to maximize terminal rewards.
  • It details a mathematical formulation using SDEs and Hamilton–Jacobi–Bellman equations to derive optimal Gaussian policies for controlled diffusion.
  • Empirical insights demonstrate enhanced sample detail, robust reward maximization, and improved efficiency in applications like image editing and biomolecular design.

Reward-guided diffusion as stochastic optimal control (SOC) is a framework wherein the sampling or training dynamics of diffusion models are formulated as an entropy-regularized continuous-time control problem. The control aims to maximize specific terminal rewards (e.g., downstream utility), while retaining fidelity to the original data-generating distribution. SOC unifies several previously distinct schemes for guided and conditional diffusion modeling, providing a principled foundation for both fine-tuning and inference-time alignment with non-differentiable objectives, including black-box or user-provided rewards. SOC-based diffusion methods encompass both continuous stochastic differential equations (SDEs) and their discrete analogues in the form of Markov decision processes (MDPs) and continuous-time Markov chains (CTMCs).

1. Mathematical Formulation: SDEs and Controlled Diffusion

In the continuous-time SOC framework for diffusion modeling, the forward SDE is typically an Ornstein–Uhlenbeck or variance-preserving (VP) process: dxt=f(t)xtdt+g(t)dBt,x0p0dx_t = -f(t)x_t\,dt + g(t)\,dB_t,\qquad x_0\sim p_0 where f(t)f(t) and g(t)g(t) characterize the drift and diffusion schedule, and p0p_0 is the data distribution. The time-reverse dynamics (reverse SDE) are given by

dzt=[f(Tt)zt+(g(Tt))2xlogpTt(zt)]dt+g(Tt)dWtdz_t = \left[f(T-t)z_t + \bigl(g(T-t)\bigr)^2\nabla_x\log p_{T-t}(z_t)\right]dt + g(T-t)\,dW_t

with the "score" xlogpTt(zt)\nabla_x \log p_{T-t}(z_t) being, in practice, unknown.

Reward-guided control introduces an additive control field ata_t (or, more generally, utu_t) in place of the unknown score, yielding the controlled reverse process: dyt=[f(Tt)yt+(g(Tt))2at]dt+g(Tt)dWtdy_t = \left[f(T-t)y_t + \bigl(g(T-t)\bigr)^2 a_t\right]dt + g(T-t)\,dW_t The objective is to design a policy for ata_t, akin to a stochastic policy in RL, which steers the denoising trajectory to maximize the expected terminal reward (e.g., task-specific utility at yTy_T), subject to regularization encouraging proximity to the original diffusion measure (Gao et al., 2024).

2. Stochastic Optimal Control Objective and HJB Characterization

The reward-guided SOC problem seeks to maximize a value functional of the form: Jπ(t,y)=E[tT[r(s,ys,as)θlogπ(ass,ys)]ds+βh(yT)yt=y]J^\pi(t, y) = \mathbb{E}\left[\int_t^T \left[r(s, y_s, a_s) - \theta \log \pi(a_s|s, y_s)\right] ds + \beta h(y_T)\,|\, y_t = y \right] where rr penalizes deviations between the proposed control and the unknown score, θ\theta weights exploration (entropy bonus), and hh is the terminal reward (Gao et al., 2024).

The value function V(t,x)V(t,x) satisfies a Hamilton–Jacobi–Bellman (HJB) PDE with a log-partition term due to entropy regularization: Vt(t,x)+θlog(Rdexp(1θH(t,x,a,V,2V))da)=0V_t(t, x) + \theta \log\left(\int_{\mathbb{R}^d} \exp\left(\frac{1}{\theta} H(t, x, a, \nabla V, \nabla^2 V)\right) da\right) = 0 with the generalized Hamiltonian

H(t,x,a,p,q)=r(t,x,a)+[f(Tt)x+g(Tt)2a]p+12g(Tt)2:qH(t, x, a, p, q) = r(t, x, a) + [f(T-t)x + g(T-t)^2 a]\cdot p + \frac{1}{2}g(T-t)^2 : q

This leads to an optimal Gaussian policy for the control, and, crucially, the control mean involves both the current score and the gradient of the value function (Gao et al., 2024, Jiao et al., 4 Dec 2025).

3. Policy Structure and Actor–Critic Algorithms

Given the quadratic nature of the SOC Hamiltonian in the control aa, the entropy-regularized SOC yields a closed-form Gaussian optimal policy: π(at,x)=N(μ(t,x),θ2g(Tt)2Id)\pi^*(a|t, x) = \mathcal{N}\left(\mu^*(t, x),\,\frac{\theta}{2g(T-t)^2}I_d\right) with mean

μ(t,x)=xlogpTt(x)+12xV(t,x)\mu^*(t,x) = \nabla_x\log p_{T-t}(x) + \frac{1}{2}\nabla_x V(t,x)

The reward and policy structure allow for actor–critic q-learning implementations, where the critic approximates the value function and the actor parameterizes the Gaussian policy mean. Data-driven noisy estimates of the score can be constructed through ratio estimation based on forward process densities (Gao et al., 2024).

Training proceeds via temporal-difference learning using the martingale property of the value function, with stochastic updates of both actor and critic. Extensions accommodate probability flow ODEs and context-conditioned diffusion by augmenting the state to include auxiliary variables (Gao et al., 2024).

4. Connection to KL-Control, HJB, and Path-Integral SOC

Reward-guided diffusion is fundamentally a path-integral control problem: the optimal controlled measure over trajectories is a KL-tilted exponential reweighting of the uncontrolled process,

p(τ)puncontrolled(τ)exp(1αr(xT))p^*(\tau) \propto p^{\text{uncontrolled}}(\tau)\,\exp\left(\frac{1}{\alpha} r(x_T)\right)

where rr is the terminal reward (Zhang et al., 2023, Su et al., 1 Jul 2025). The solution to the HJB PDE (or its Feynman-Kac representation) provides the desirability function or soft value, the gradient of which gives the optimal feedback. In discrete time, analogous KL-tilted distributions enable low-variance off-policy algorithms via the exponential pay-off distribution or forward-KL distillation (Zhang et al., 2023, Su et al., 1 Jul 2025).

Inference-time guidance methods, such as classifier guidance and Sequential Monte Carlo (SMC) approaches, can be interpreted as practical approximations to these SOC-driven optimal processes; they seek to sample from or emulate the pathwise Gibbs measure defined above (Uehara et al., 16 Jan 2025, Jiao et al., 4 Dec 2025). The same principles extend to discrete settings, where MDP/CTMC analogues and off-policy corrections via Monte Carlo Tree Search or replay buffers become central (Tang et al., 29 Sep 2025).

5. Loss Functions, Variance Reduction, and Memoryless Schedules

Optimization of reward-guided diffusion via SOC involves estimators spanning REINFORCE (score function), pathwise/adjoint, matching and KL-based losses (Domingo-Enrich, 2024, Domingo-Enrich et al., 2024). Variance reduction is a critical issue, and adjoint-matching losses based on lean adjoint ODEs and memoryless noise schedules provide favorable statistical properties—matching the expected gradient with reduced variance and ensuring unbiasedness with minimal bias from initial or final time dependencies (Domingo-Enrich et al., 2024).

Explicitly, memoryless schedules (e.g., σ(t)=2ηt\sigma(t) = \sqrt{2\eta_t}) ensure that tilting by the reward yields exact marginal alignment at terminal time, as opposed to naive KL-regularized objectives which introduce a residual bias. This property underpins the theoretical correctness of reward-guided tilting in both diffusion and flow-matching models (Domingo-Enrich et al., 2024).

6. Extensions and Application Scenarios

SOC-based reward-guided diffusion generalizes to:

  • Probability flow ODEs, where drift control applies to deterministic generative flows.
  • Conditional diffusion, augmenting the control and value function with context variables (Gao et al., 2024).
  • Discrete state spaces (e.g., sequence generation), with controlled CTMCs and trajectory replay via MCTS in discrete diffusion (Tang et al., 29 Sep 2025).
  • Training-free, inference-time guidance, and adaptive schedule selection (e.g., adaptive classifier-free guidance or training-free image editing), relying on Pontryagin’s Maximum Principle and efficient approximation-by-iteration in high-dimensional domains (Chang et al., 30 Sep 2025, Azangulov et al., 25 May 2025).

7. Empirical Insights and Practical Considerations

Empirically, reward-guided SOC reformulations yield substantial benefits over prior hard-constrained or heuristic approaches:

  • Improved sample sharpness and detail in diffusion bridges and image restoration tasks by tuning control/penalty weights away from strict endpoint constraints (Zhu et al., 9 Feb 2025).
  • Robust reward maximization and fidelity to source distributions in image editing, as demonstrated by trajectory control optimizing both reward alignment and content faithfulness (Chang et al., 30 Sep 2025).
  • Enhanced stability and sample efficiency in biomolecular design and other scientific domains, via forward-KL iterative distillation and off-policy data collection (Su et al., 1 Jul 2025).
  • Superior coverage and controllability in discrete settings, demonstrated by trajectory-aware fine-tuning with MCTS (Tang et al., 29 Sep 2025).

The choice of control cost, loss function, and optimizer is guided by trade-offs among bias, variance, and computational tractability, as summarized in the taxonomy of SOC loss functions (Domingo-Enrich, 2024). Memoryless fine-tuning and adjoint matching currently provide state-of-the-art consistency, stability, and sample diversity in reward-based diffusion fine-tuning (Domingo-Enrich et al., 2024).


References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Reward-Guided Diffusion as Stochastic Optimal Control (SOC).