Guided Diffusion with Control-Theoretic Rewards

Updated 17 December 2025

The paper introduces a control-theoretic guided diffusion framework that integrates optimal control, reinforcement learning, and gradient-based guidance to optimize reward functions without sacrificing sample quality.
The methodology reformulates the reverse diffusion process as a controlled stochastic dynamical system using score-based SDEs and HJB equations to incorporate reward potentials.
Practical implementations span inference-time guidance, RL fine-tuning, and conditional score learning, achieving notable gains in continuous control, image generation, and graph-based tasks.

Guided diffusion with control-theoretic rewards is a framework for steering diffusion-based generative models to produce samples that optimize explicit reward objectives while maintaining fidelity to a foundation model's learned distribution. By integrating optimal control, reinforcement learning (RL), and gradient-based guidance, this approach endows diffusion models with the ability to satisfy task-specific criteria—ranging from maximizing downstream utility functions in sequential decision-making to enforcing constraints in graph or protein design—without sacrificing sample quality or mode coverage.

1. Mathematical Formalism of Control-Theoretic Guidance

At the core of guided diffusion models is the interpretation of the reverse denoising process as a controlled stochastic dynamical system. In continuous time, consider the forward SDE:

$dX_t = f(t, X_t)dt + g(t)dW_t, \quad X_0 \sim p_0,$

and its corresponding score-based reverse-time SDE:

$dY_t = \left[ f(T-t, Y_t) + g^2(T-t)s_t(Y_t)\right]dt + g(T-t)d\overline{W}_t,$

where $s_t(x) = \nabla_x \log p_t(x)$ is the score function at noise level $t$ (Jiao et al., 4 Dec 2025). The guidance objective is to modify the generative process to sample from a reward-weighted target measure, such that for a terminal reward $r^{\mathrm{ext}}(x)$ , the new data distribution is

$p^{\mathrm{rw}}_0(x) \propto r^{\mathrm{ext}}(x) p_0(x).$

The induced control law for the reverse SDE is to inject a drift

$g_t(x) = \nabla_x \log p_t^{\mathrm{rw}}(x) - \nabla_x \log p_t(x),$

yielding the guided SDE:

$dY_t^w = \left[ f(T-t, Y_t^w) + (1-w)s_t(Y_t^w) + w s_t^{\mathrm{rw}}(Y_t^w) \right] dt + g(T-t)d\overline{W}_t,$

where $w$ interpolates between the unguided model and the reward-weighted drift (Jiao et al., 4 Dec 2025).

Guidance can be implemented in discrete-time DDPMs analogously, by convexly combining the predicted scores of the base and reward-weighted models at each step (Jiao et al., 4 Dec 2025, Nuti et al., 2023).

2. Extraction and Use of Reward Potentials

When explicit reward functions are not given, relative reward functions can be extracted by comparing the scores of an expert (high-reward) diffusion model $s_\Theta(x, t)$ and a base (low-reward) model $s_\phi(x, t)$ :

$h_t(x) = s_\Theta(x, t) - s_\phi(x, t).$

The existence and uniqueness of a conservative field, i.e., a potential function $\rho(x, t)$ such that $\nabla_x \rho(x, t) \approx h_t(x)$ , is established as the L²-projection onto smooth gradients (Nuti et al., 2023). This potential can be learned via gradient alignment:

$\min_\theta \, \mathbb{E}_{t, x_t} \left\| \nabla_x \rho_\theta(x_t, t) - [s_\Theta(x_t, t) - s_\phi(x_t, t)] \right\|_2^2.$

Once learned, $\nabla_x \rho_\theta$ can be injected as a guidance term via classifier-like gradient shaping in the reverse diffusion process (Nuti et al., 2023).

3. Reward-Guided Diffusion as Stochastic Optimal Control

Guided diffusion can be reframed as an entropy-regularized stochastic optimal control problem. The objective is to maximize the expected terminal reward $r(x_T)$ while penalizing the deviation from the baseline score (e.g., via KL divergence):

$\max_{g} \mathbb{E}[r(x_T)] - \frac{1}{2}\int_0^T \| f^\mathrm{pre}(t, x_t) - g(t, x_t)\|^2 / \sigma^2(t) dt,$

subject to

$dx_t = g(t, x_t)dt + \sigma(t)dw_t,$

with optimal control drift given by

$g^\ast(t, x) = f^\mathrm{pre}(t, x) + \sigma^2(t)\nabla_x v_t^\ast(x),$

where $v_t^\ast$ solves the corresponding Hamilton-Jacobi-Bellman (HJB) equation with boundary condition $v_T^\ast(x) = r(x)$ (Zhao et al., 17 Jun 2024, Gao et al., 7 Sep 2024). This connection is explicit in both the sampling dynamics and in the derivation of the guidance term.

4. Algorithmic Implementations and Practical Workflows

Practical guided diffusion implementations fall into three main paradigms:

Paradigm	Key Principle	Example Works
Inference-time Guidance	Add reward gradient or value function to	(Jiao et al., 4 Dec 2025, Uehara et al., 16 Jan 2025),
	the reverse process without retraining	(Nuti et al., 2023, Tenorio et al., 26 May 2025)
Fine-tuning via RL	Optimize the diffusion score model using	(Zhao et al., 17 Jun 2024, Zhang et al., 2023),
	policy gradient or actor-critic with KL	(Huh et al., 16 Feb 2025, Gao et al., 7 Sep 2024)
Conditional Score Learning	Train conditional scores or reward networks	(Yuan et al., 2023, Cheng et al., 29 Sep 2025)

Inference-time guidance (classifier guidance, SMC, SVDD, value-based) applies a reward or cost gradient at each step, e.g.,

$s_{\mathrm{guided}}(x_t, t) = s_{\text{base}}(x_t, t) + w \nabla_{x_t} R(x_t) ,$

or uses value functions as lookahead/importance weights in SMC or beam search (Uehara et al., 16 Jan 2025, Tenorio et al., 26 May 2025).

Policy fine-tuning re-optimizes the generative model itself, either via RL analogs (REINFORCE, actor-critic, PPO) or direct preference optimization (DPO), always with explicit KL divergence regularization towards the foundation model (Huh et al., 16 Feb 2025).

Conditional score learning encodes rewards or constraints as conditioning variables, learning a score network $s_\theta(x, r, t)$ directly via reward- or value-weighted denoising objectives (Yuan et al., 2023, Cheng et al., 29 Sep 2025).

5. Extensions: Non-differentiability, Generalization, and Beyond Trajectories

Guidance with control-theoretic rewards naturally accommodates non-differentiable and black-box rewards. For discrete spaces or graphs, zero-order (ZO), best-of- $N$ , and smoothed gradient estimators allow gradient-free optimization (Tenorio et al., 26 May 2025). Empirically, GGDiff demonstrates both gradient-based and ZO approaches can be unified in a plug-and-play framework for conditional graph diffusion, with improved constraint satisfaction and diversity over baselines.

Safety and stability constraints can be incorporated using Lyapunov-type energy functions. In S²Diff, a control Lyapunov barrier function $V$ is learned to satisfy both positivity and decrease constraints; its sublevel sets enforce trajectory safety, while soft penalties regularize deviations (Cheng et al., 29 Sep 2025).

Generalization to image and structured domains is demonstrated by learning reward potentials to distinguish safe vs. unsafe generations or rare mode emergence (e.g., in Stable Diffusion). The control-theoretic formalism ensures that guidance transfers to previously unseen base models, driving alignment under new reward signals (Nuti et al., 2023).

6. Sample Complexity, Robustness, and Limitations

The reward-guided framework provides tunable trade-offs between reward maximization and divergence from the original distribution, controlled via guidance hyperparameters (e.g., $w$ , KL penalty) or temperature terms. Empirical studies reveal:

On continuous control benchmarks (Maze2D, D4RL), guided diffusion with reward or Lyapunov shaping yields 17–43% gains in average trajectory reward over base diffusion models (Nuti et al., 2023, Huh et al., 16 Feb 2025).
In conditional generation (e.g., image compressibility or aesthetics), RL-based guidance matches or surpasses reconstruction-based baselines with higher coverage of rare modes (Zhao et al., 17 Jun 2024).
On graph tasks, unified guidance improves constraint alignment (motif, fairness) and link prediction accuracy versus non-guided priors (Tenorio et al., 26 May 2025).

Nonetheless, limitations include computational overhead of fine-tuning (e.g., backprop through denoising chains), Monte Carlo variance for score estimators in RL-guided methods, and potential mode collapse if guidance is not properly regularized. For high-dimensional or non-smooth rewards, careful design of guidance strength and reward shaping is critical to maintain sample quality (Gao et al., 7 Sep 2024).

7. Unified Theoretical Perspective and Empirical Impact

Recent theoretical development shows that classifier guidance, classifier-free guidance, and generic reward-guided samplers are all manifestations of entropy-regularized optimal control, or equivalently, Doob h-transforms in diffusion processes (Jiao et al., 4 Dec 2025). The main theorem (Jiao et al., 4 Dec 2025) quantifies that reward-guided SDEs lead to strictly improved expected reward, with an explicit integral characterization of improvement. These guarantees hold for general reward signals and validate empirical findings across multiple fields.

Error bounds and subspace-fidelity theorems further link distributional shift, reward improvement, and off-support regret, providing clear performance metrics and stability criteria for guided diffusion models (Yuan et al., 2023).

Empirical highlights range from real-world humanoid control (e.g., zero-shot joystick navigation, obstacle avoidance) (Liao et al., 11 Aug 2025), image generation with controlled attributes, to protein design and structural biology for optimizing affinity and stability (Uehara et al., 16 Jan 2025). Across domains, guided diffusion with control-theoretic rewards represents a general toolkit for sample-efficient, robust, and adaptable conditional generative modeling.