Diffusion Controller (DiffCon)

Updated 4 July 2026

DiffCon is a control-theoretic framework for controllable diffusion generation, reformulating reverse sampling as a state-only stochastic control problem using LS-MDP.
It reweights pretrained reverse-time kernels through f-divergence penalties, including KL regularization, to optimize terminal objectives with precise reward signals.
Under KL divergence, the framework yields a linear Bellman recursion and structured score decomposition, enabling effective gray-box adaptations and improved fine-tuning.

Searching arXiv for Diffusion Controller (DiffCon) and closely related work to ground the article in papers. Diffusion Controller, usually abbreviated DiffCon, denotes a control-theoretic framework for controllable diffusion generation that interprets reverse diffusion sampling as state-only stochastic control in a generalized linearly-solvable Markov decision process (LS-MDP) (Yang et al., 7 Mar 2026). In this formulation, a pretrained diffusion model supplies the passive reverse-time dynamics, while control acts by reweighting those reverse transition kernels to optimize a terminal objective under an $f$ -divergence penalty relative to the pretrained model (Yang et al., 7 Mar 2026). The framework was introduced to unify methods that had often been treated as separate heuristics—classifier or reward guidance, reinforcement-learning fine-tuning, reward-weighted regression, and lightweight adapters—under one stochastic-control language (Yang et al., 7 Mar 2026).

1. Definition and conceptual scope

DiffCon is motivated by the observation that controllable diffusion generation is typically implemented through seemingly heterogeneous mechanisms: some methods alter the score during inference, some fine-tune the model with rewards, some regularize toward the pretrained model, and some attach lightweight adapters to a frozen backbone (Yang et al., 7 Mar 2026). DiffCon recasts these operations as instances of a single problem: modifying a pretrained reverse diffusion process so that final samples optimize an external objective while remaining close to the original generator.

The framework uses a reverse indexing convention aligned with reinforcement-learning notation: $x_T$ is the clean sample and $x_1$ is pure noise. The forward noising process is written as

$x_t = \sqrt{\alpha_t}\,x_{t+1} + \sqrt{\beta_t}\,\xi_t = \sqrt{\overline{\alpha}_t}\,x_T + \sqrt{1-\overline{\alpha}_t}\,\xi,$

with $\alpha_t=1-\beta_t$ and $\overline{\alpha}_t=\prod_{i=t}^{T-1}\alpha_i$ (Yang et al., 7 Mar 2026). A pretrained score or noise predictor $\epsilon_0(x_t,c,t)$ induces a reverse-time Gaussian kernel

$p_{0,t}(x_{t+1}\mid x_t,c)=\mathcal N\!\left(x_{t+1}\mid \mu_0(x_t,c,t),\,\widetilde\beta_t I_d\right),$

where $p_{0,t}$ is treated as the passive dynamics of the control problem (Yang et al., 7 Mar 2026).

In that sense, DiffCon is not a new diffusion backbone. It is a framework for controlling an existing pretrained reverse process. The central object is the controlled transition law, not a standalone sampling heuristic.

2. Reverse diffusion as state-only stochastic control

The LS-MDP induced by DiffCon is defined over the diffusion trajectory, with state effectively $(x_t,c)$ , explicit time index $x_T$ 0, and fixed condition $x_T$ 1 along the trajectory (Yang et al., 7 Mar 2026). Control is a measurable function $x_T$ 2 that acts by exponential tilting of the pretrained reverse kernel:

$x_T$ 3

subject to normalization (Yang et al., 7 Mar 2026).

The regularized Bellman objective is

$x_T$ 4

with terminal value $x_T$ 5 (Yang et al., 7 Mar 2026). Here, $x_T$ 6 controls the trade-off between optimizing the terminal objective and remaining close to the pretrained generator.

For reward-driven fine-tuning, the reward is terminal only:

$x_T$ 7

This produces a diffusion-control problem in which every reverse denoising step pays a deviation cost relative to the pretrained kernel, but only the terminal sample receives external reward (Yang et al., 7 Mar 2026).

DiffCon also introduces a path-space formulation. If $x_T$ 8 is the pretrained reverse trajectory law on $x_T$ 9, then the controlled objective is

$x_1$ 0

This pathwise view is central for the reward-weighted regression derivation (Yang et al., 7 Mar 2026).

3. Optimality conditions and the KL-specialized LS-MDP structure

Under general $x_1$ 1-divergence regularization, DiffCon remains a state-only stochastic-control problem. Under KL regularization, however, the framework acquires the exact linear structure associated with classical LS-MDPs (Yang et al., 7 Mar 2026). In that case, the optimal control has the twisting form

$x_1$ 2

which yields the optimal controlled kernel

$x_1$ 3

The associated Bellman equation becomes

$x_1$ 4

Defining the desirability function

$x_1$ 5

with terminal condition

$x_1$ 6

linearizes the recursion:

$x_1$ 7

This is the precise LS-MDP property from which the framework takes its name (Yang et al., 7 Mar 2026).

The KL case is also the regime in which DiffCon establishes its strongest equivalence between path-space optimization and terminal-law optimization. Specifically,

$x_1$ 8

so the optimal terminal law is an exponentially tilted version of the pretrained terminal law (Yang et al., 7 Mar 2026). This result underlies the KL minimizer-preservation guarantee for reward-weighted regression.

A common misconception is that these exact linearity and minimizer-preservation results extend uniformly to arbitrary $x_1$ 9-divergences. They do not. The clean desirability recursion and exact coincidence between the tractable reward-weighted target and the true controlled marginal are stated as special to KL (Yang et al., 7 Mar 2026).

4. Derived algorithms for fine-tuning and control

DiffCon derives two main algorithmic families from the control formulation: $x_t = \sqrt{\alpha_t}\,x_{t+1} + \sqrt{\beta_t}\,\xi_t = \sqrt{\overline{\alpha}_t}\,x_T + \sqrt{1-\overline{\alpha}_t}\,\xi,$ 0-divergence-regularized policy gradients and reward-weighted regression (Yang et al., 7 Mar 2026).

For a learnable reverse kernel

$x_t = \sqrt{\alpha_t}\,x_{t+1} + \sqrt{\beta_t}\,\xi_t = \sqrt{\overline{\alpha}_t}\,x_T + \sqrt{1-\overline{\alpha}_t}\,\xi,$ 1

the regularized objective yields the policy-gradient theorem

$x_t = \sqrt{\alpha_t}\,x_{t+1} + \sqrt{\beta_t}\,\xi_t = \sqrt{\overline{\alpha}_t}\,x_T + \sqrt{1-\overline{\alpha}_t}\,\xi,$ 2

where $x_t = \sqrt{\alpha_t}\,x_{t+1} + \sqrt{\beta_t}\,\xi_t = \sqrt{\overline{\alpha}_t}\,x_T + \sqrt{1-\overline{\alpha}_t}\,\xi,$ 3 is the generalized advantage (Yang et al., 7 Mar 2026). Under KL regularization this simplifies to

$x_t = \sqrt{\alpha_t}\,x_{t+1} + \sqrt{\beta_t}\,\xi_t = \sqrt{\overline{\alpha}_t}\,x_T + \sqrt{1-\overline{\alpha}_t}\,\xi,$ 4

with $x_t = \sqrt{\alpha_t}\,x_{t+1} + \sqrt{\beta_t}\,\xi_t = \sqrt{\overline{\alpha}_t}\,x_T + \sqrt{1-\overline{\alpha}_t}\,\xi,$ 5 (Yang et al., 7 Mar 2026).

DiffCon also supplies a PPO-style update rule in which clipping is applied to reverse-transition density ratios along denoising trajectories, rather than to action probabilities in a conventional MDP (Yang et al., 7 Mar 2026). The paper explicitly relates the unregularized limit $x_t = \sqrt{\alpha_t}\,x_{t+1} + \sqrt{\beta_t}\,\xi_t = \sqrt{\overline{\alpha}_t}\,x_T + \sqrt{1-\overline{\alpha}_t}\,\xi,$ 6 to DDPO, and under KL connects the formulation to DPOK while noting a different regularization structure (Yang et al., 7 Mar 2026).

The second family is reward-weighted regression. DiffCon shows that the path-space $x_t = \sqrt{\alpha_t}\,x_{t+1} + \sqrt{\beta_t}\,\xi_t = \sqrt{\overline{\alpha}_t}\,x_T + \sqrt{1-\overline{\alpha}_t}\,\xi,$ 7-regularized problem induces a denoising loss of the form

$x_t = \sqrt{\alpha_t}\,x_{t+1} + \sqrt{\beta_t}\,\xi_t = \sqrt{\overline{\alpha}_t}\,x_T + \sqrt{1-\overline{\alpha}_t}\,\xi,$ 8

with weight

$x_t = \sqrt{\alpha_t}\,x_{t+1} + \sqrt{\beta_t}\,\xi_t = \sqrt{\overline{\alpha}_t}\,x_T + \sqrt{1-\overline{\alpha}_t}\,\xi,$ 9

For KL divergence this becomes the exponential weight

$\alpha_t=1-\beta_t$ 0

while for $\alpha_t=1-\beta_t$ 1-divergence the paper gives the polynomial form

$\alpha_t=1-\beta_t$ 2

Under KL, DiffCon proves that this tractable reward-weighted loss preserves the correct minimizer associated with the true controlled marginal (Yang et al., 7 Mar 2026).

5. Structured score decomposition and gray-box parameterization

A distinctive contribution of DiffCon is that it does not treat the controlled score as an arbitrary residual. Under KL regularization, the paper derives a structured decomposition of the optimal score into a pretrained baseline plus a control-dependent correction (Yang et al., 7 Mar 2026). This motivates the practical model form

$\alpha_t=1-\beta_t$ 3

where the side network $\alpha_t=1-\beta_t$ 4 is a structured function of the pretrained reverse mean $\alpha_t=1-\beta_t$ 5, the condition $\alpha_t=1-\beta_t$ 6, and time $\alpha_t=1-\beta_t$ 7 (Yang et al., 7 Mar 2026).

This is the basis for DiffCon’s gray-box adaptation strategy. The backbone remains frozen, but exposed denoising outputs—especially $\alpha_t=1-\beta_t$ 8 or $\alpha_t=1-\beta_t$ 9—are used to drive a lightweight controller network (Yang et al., 7 Mar 2026). The side network is therefore not merely a parameter-efficient add-on; it is presented as a parameterization implied by the LS-MDP analysis.

In the implementation reported for Stable Diffusion v1.4, the side network is a lightweight latent-space U-Net operating on $\overline{\alpha}_t=\prod_{i=t}^{T-1}\alpha_i$ 0 latents with $\overline{\alpha}_t=\prod_{i=t}^{T-1}\alpha_i$ 1 channels, using standard DownBlock2D, UpBlock2D, and CrossAttnMidBlock2D blocks (Yang et al., 7 Mar 2026). Its final output is split into one channel for the scalar gate $\overline{\alpha}_t=\prod_{i=t}^{T-1}\alpha_i$ 2 and the remaining channels for $\overline{\alpha}_t=\prod_{i=t}^{T-1}\alpha_i$ 3, and the final layer is zero-initialized so that the initial model exactly reproduces the pretrained backbone (Yang et al., 7 Mar 2026).

The empirical baseline called DiffCon-Naive is important for interpretation. It uses a similarly sized side network but predicts an unstructured additive residual,

$\overline{\alpha}_t=\prod_{i=t}^{T-1}\alpha_i$ 4

rather than the LS-MDP-derived structured correction (Yang et al., 7 Mar 2026). The performance gap between DiffCon and DiffCon-Naive is used to argue that the framework contributes a model form, not only a small adapter.

6. Empirical behavior on Stable Diffusion v1.4

DiffCon is evaluated on Stable Diffusion v1.4 for both supervised fine-tuning and reward-driven fine-tuning (Yang et al., 7 Mar 2026). In supervised fine-tuning, the target data are winner images from HPD-v2 with prompts. In reward-driven experiments, the reward is HPS-v2, and two training regimes are considered: reward-weighted loss and KL-regularized PPO (Yang et al., 7 Mar 2026).

The main compared systems are DiffCon, DiffCon-Naive, LoRA, DiffCon-J, and DiffCon-S. DiffCon and DiffCon-Naive use $\overline{\alpha}_t=\prod_{i=t}^{T-1}\alpha_i$ 5 trainable parameters, LoRA uses $\overline{\alpha}_t=\prod_{i=t}^{T-1}\alpha_i$ 6, and DiffCon-J and DiffCon-S use $\overline{\alpha}_t=\prod_{i=t}^{T-1}\alpha_i$ 7 (Yang et al., 7 Mar 2026).

The headline metric is HPS-v2 win rate. At the end of training, the reported win rates against the pretrained model are as follows (Yang et al., 7 Mar 2026):

Setting	DiffCon	DiffCon-Naive	LoRA / best combined
SFT, step 1000	$\overline{\alpha}_t=\prod_{i=t}^{T-1}\alpha_i$ 8	$\overline{\alpha}_t=\prod_{i=t}^{T-1}\alpha_i$ 9	LoRA $\epsilon_0(x_t,c,t)$ 0; DiffCon-J/S $\epsilon_0(x_t,c,t)$ 1
RWL, step 2000	$\epsilon_0(x_t,c,t)$ 2	$\epsilon_0(x_t,c,t)$ 3	LoRA $\epsilon_0(x_t,c,t)$ 4; DiffCon-S $\epsilon_0(x_t,c,t)$ 5
PPO, step 2400	$\epsilon_0(x_t,c,t)$ 6	$\epsilon_0(x_t,c,t)$ 7	LoRA $\epsilon_0(x_t,c,t)$ 8; DiffCon-J $\epsilon_0(x_t,c,t)$ 9

These results support three distinct claims. First, within the gray-box setting, DiffCon outperforms both DiffCon-Naive and LoRA in supervised fine-tuning and reward-weighted learning (Yang et al., 7 Mar 2026). Second, the structured side-network parameterization matters empirically; the naïve additive residual is markedly weaker (Yang et al., 7 Mar 2026). Third, in the white-box setting, combining DiffCon with LoRA yields the strongest PPO results, with win rates above $p_{0,t}(x_{t+1}\mid x_t,c)=\mathcal N\!\left(x_{t+1}\mid \mu_0(x_t,c,t),\,\widetilde\beta_t I_d\right),$ 0 for DiffCon-J and DiffCon-S (Yang et al., 7 Mar 2026).

The paper also reports supporting ablations. Increasing test-time side-network guidance strength improves supervised fine-tuning and reward-weighted learning; for PPO, $p_{0,t}(x_{t+1}\mid x_t,c)=\mathcal N\!\left(x_{t+1}\mid \mu_0(x_t,c,t),\,\widetilde\beta_t I_d\right),$ 1 is reported as best, and removing KL regularization causes reward to fail to improve; in reward-weighted learning, smaller $p_{0,t}(x_{t+1}\mid x_t,c)=\mathcal N\!\left(x_{t+1}\mid \mu_0(x_t,c,t),\,\widetilde\beta_t I_d\right),$ 2 helps until numerical instability appears; and conditioning the side network on $p_{0,t}(x_{t+1}\mid x_t,c)=\mathcal N\!\left(x_{t+1}\mid \mu_0(x_t,c,t),\,\widetilde\beta_t I_d\right),$ 3 works better than conditioning it on $p_{0,t}(x_{t+1}\mid x_t,c)=\mathcal N\!\left(x_{t+1}\mid \mu_0(x_t,c,t),\,\widetilde\beta_t I_d\right),$ 4 in reward-weighted learning (Yang et al., 7 Mar 2026).

7. Relation to adjacent diffusion-control literature and limitations

The broader literature uses diffusion in several controller-like roles that are related to, but distinct from, the named DiffCon framework. SteeringDiffusion defines a bottlenecked activation control interface for frozen diffusion backbones, with a runtime scalar $p_{0,t}(x_{t+1}\mid x_t,c)=\mathcal N\!\left(x_{t+1}\mid \mu_0(x_t,c,t),\,\widetilde\beta_t I_d\right),$ 5 that traverses a smooth content–style trade-off surface while preserving exact zero-scale equivalence to the base model (Wu et al., 3 May 2026). SLCD casts KL-regularized controllable generation as optimal classifier guidance, learns a lightweight reward-distribution predictor by online supervised learning, and proves convergence to the KL-regularized optimum under no-regret assumptions (Oertell et al., 27 May 2025). Constrained Diffusers and DPCC insert explicit constraints into reverse sampling or denoising-time projections, thereby turning pretrained diffusion models into safe constrained planners or predictive controllers at test time (Zhang et al., 14 Jun 2025, Römer et al., 2024). In robotics, diffusion has also been used as a direct single-step action policy, a receding-horizon planner, or a unified planner-controller over state-action futures rather than as a fine-tuning framework for image generation (Mothish et al., 2024, Wei et al., 2024, Wu et al., 17 Apr 2025, Wang et al., 15 Jun 2026).

This suggests that “diffusion controller” is polysemous in the recent literature. The named DiffCon of (Yang et al., 7 Mar 2026) specifically refers to the LS-MDP view of controllable diffusion generation, together with RL objectives and a structured gray-box parameterization. It is not a synonym for every diffusion-based robotic controller, constrained sampler, or activation-steering interface.

Several limitations are explicit in the DiffCon paper. The strongest theoretical results are KL-specific: exact LS-MDP linearity, exponential twisting, and minimizer-preserving reward-weighted regression are established in that regime, whereas for general $p_{0,t}(x_{t+1}\mid x_t,c)=\mathcal N\!\left(x_{t+1}\mid \mu_0(x_t,c,t),\,\widetilde\beta_t I_d\right),$ 6-divergences the tractable reward-weighted target $p_{0,t}(x_{t+1}\mid x_t,c)=\mathcal N\!\left(x_{t+1}\mid \mu_0(x_t,c,t),\,\widetilde\beta_t I_d\right),$ 7 generally differs from the true controlled marginal $p_{0,t}(x_{t+1}\mid x_t,c)=\mathcal N\!\left(x_{t+1}\mid \mu_0(x_t,c,t),\,\widetilde\beta_t I_d\right),$ 8 (Yang et al., 7 Mar 2026). The score-representation theorem also assumes KL regularization, pretrained score optimality, and bounded clean samples $p_{0,t}(x_{t+1}\mid x_t,c)=\mathcal N\!\left(x_{t+1}\mid \mu_0(x_t,c,t),\,\widetilde\beta_t I_d\right),$ 9 (Yang et al., 7 Mar 2026). Empirically, the reported experiments are centered on Stable Diffusion v1.4 and preference-alignment-style objectives rather than on broader control tasks (Yang et al., 7 Mar 2026).

A second recurring misconception is that DiffCon’s gray-box adaptation is simply “another side network.” The framework’s claim is narrower and more technical: the side network is justified by a structured score decomposition implied by the LS-MDP formulation, and the comparison against DiffCon-Naive is intended to show that an unstructured residual does not recover the same behavior (Yang et al., 7 Mar 2026). A plausible implication is that DiffCon’s main contribution lies in the coupling of framework, algorithms, and parameterization, rather than in any one of those components taken alone.