Controlled Diffusion Models
- Controlled diffusion models are stochastic generative algorithms that apply explicit, time-dependent control signals to steer diffusion processes and optimize experimental outcomes.
- They extend standard SDEs by incorporating controlled drift and diffusion terms, enabling improved parameter estimation and robust operation under noisy observations.
- Their applications span guided image/video synthesis, molecular design, and experimental control, employing methods like particle filtering, reward-guided sampling, and spatial conditioning.
Controlled diffusion models constitute a class of stochastic generative and parameter estimation algorithms in which the dynamics of the diffusion process are subject to explicit, time-dependent control, often implemented for the purpose of optimization, regulation, or guided synthesis. Within this paradigm, control inputs are incorporated into the stochastic differential equations (SDEs) that govern system evolution, enabling the manipulation of sample trajectories, experimental outcomes, or generated data according to user-specified objectives in both theoretical and applied contexts. Applications range from closed-loop experimental design and scientific data acquisition to guided image and video synthesis, molecular and protein generation, image editing, data augmentation, and scene text removal.
1. Mathematical Foundations of Controlled Nonlinear Diffusions
Controlled diffusion models extend standard SDEs by introducing input controls into the drift and, in some cases, the diffusion terms, yielding
where is the state, is a (possibly unknown) system parameter, the control signal, the controlled drift, and the covariance. The control can be optimized according to experimental or generative objectives.
A canonical objective in system identification is maximization of Fisher information:
where . Optimal policies are computed via numerical dynamic programming, discretizing time and state and recursively solving
2. Control Under Incomplete or Noisy Observations
In practice, full state is rarely observable. The recommended methodology is a separation strategy: (a) precompute the optimal control law for the full-observation case, (b) apply filtering (e.g., nonlinear particle filters) during real-time operation to infer given observations, and (c) use as the control signal.
Particle filtering is particularly effective for nonlinear diffusions: one simulates trajectories for particles, resamples according to likelihoods based on observations, and updates posterior state estimates for adaptive control and parameter estimation.
When filtering error is small, this "plug-in" approach is near-optimal; for linear-quadratic cases the separation principle guarantees optimality, and for nonlinear diffusion the approximation is both tractable and empirically effective.
3. Experimentally Guided Controllable Diffusions
Central case studies (Hooker et al., 2012) demonstrate the practical implementation:
- Bistable (double-well) system: Control input "pushes" the system toward barrier crossings, maximizing transitions that encode key information about the barrier parameter .
- Morris–Lecar neuron model: Voltage is driven to regimes with maximal Fisher information about ion conductance ; only membrane voltage is reliably measured, necessitating latent variable estimation via particle filters.
- Chemostat ecological model: Dilution rate is dynamically modulated to guide the system near states most sensitive to the unknown half-saturation constant .
These approaches demonstrate how structured, state-dependent interventions yield maximally-informative experimental outcomes.
4. Control in Generative Diffusion Models
Diffusion generative models extend the control paradigm to image, video, molecule, and sequence synthesis. Control is achieved through several mechanisms:
- Input noise preconditioning: Instead of conditioning during denoising, the process is guided by crafting input noise fields that encode object saliency or desired localization, using Inverting Gradients to imprint guidance in the initial latent vector (see (Singh et al., 2022)). The generator then produces outputs that are spatially aligned to the input artifact.
- MultiDiffusion and region fusion: Generation over large canvases uses local diffusion trajectories over spatial crops/masks, then fuses local outputs into a globally consistent scene using quadratic optimization with per-pixel weights for constraint satisfaction (Bar-Tal et al., 2023).
- Iterative closed-loop generation with external reasoning: Recent frameworks treat generation as an iterative reasoning process, with an external LLM analyzing current outputs and proposing corrections for alignment errors (e.g., in object count or spatial relations). Post-generation latent edits are then performed for self-correction, all in a training-free manner (Wu et al., 2023).
- Modular control adapters and cross-modality constraints: The CMCM-DLM model applies structural (e.g., scaffold) constraints in early denoising steps and chemical or property constraints in later steps, using composable control modules for plug-and-play multi-objective generation without retraining (Zhang et al., 20 Aug 2025).
5. Inference-Time Reward-Guided and Zigzag Sampling
Controlled diffusion generation has been advanced by inference-time alignment techniques:
- Reward-guided sampling: Diffusion sampling can be reframed as hill climbing over a reward landscape, with policies adjusting trajectories to maximize downstream utility (e.g., protein stability). Methods such as Sequential Monte Carlo, value-based sampling, and derivative-based guidance approximate the soft-optimal policy:
where is a value function capturing expected future reward (Uehara et al., 16 Jan 2025).
- Controlled zigzag sampling (Ctrl-Z): When detection, via a reward model, indicates stagnation at local maxima, the sampler injects noise and reverts to a previous, noisier latent state, with multiple candidate trajectories evaluated for progress (Mao et al., 25 Jun 2025). Adaptive depth inversion and reward-guided candidate selection allow for robust escapes from optimization plateaus.
6. Conditional, Spatial, and Semantic Grounding
For fine-grained spatial and semantic control, contemporary text-to-image models admit per-object and per-region conditioning:
- Grounded control via semantic tokens and spatial embeddings: ObjectDiffusion fuses CLIP-based text token embeddings with Fourier-encoded bounding box coordinates, transforming these into control tokens that are injected via gated self-attention at multiple network layers (Süleyman et al., 15 Jan 2025).
- Integrative architectural modifications: ControlNet branches, zero-initialized convolution connectors, and multi-layered conditional injections allow granular external control over both object content and placement, achieving state-of-the-art precision and recall in controlled image synthesis.
7. Applications Across Domains
Controlled diffusion models have enabled advances in diverse areas, including:
- Experiment design and system identification: Maximally-informative sampling via closed-loop optimal control (Hooker et al., 2012).
- Guided image/video generation: Preconditioning and region-based spatial fusion yield adaptable synthesis for panoramas, region-specific edits, and virtual try-on video (Bar-Tal et al., 2023, He et al., 15 Jul 2024).
- Molecule and protein design: Modular cross-modality adapters allow multi-property guided generation, efficient optimization, and scaffold preservation (Zhang et al., 20 Aug 2025).
- Scene text removal and inpainting: High-fidelity reconstruction using ControlNet-based diffusion with segmentation-refined masks and mask autoencoders (Pathak et al., 29 Oct 2024).
- Training data augmentation: Controlled generation aligned by language prompts (GPT), detection maps, and model feedback for robust dataset expansion in weakly-supervised segmentation and classification (Wu et al., 2023, Yeo et al., 22 Mar 2024).
- Scientific imaging: Controlled unconditional sampling (e.g., by conditioning on porosity) enhances representativity in 3D porous media reconstruction and supports uncertainty quantification (Naiff et al., 31 Mar 2025).
- Camera-controlled video synthesis: CameraCtrl II fuses calibrated camera parameter embeddings to produce coherent, dynamic scenes with precise trajectory injection and classifier-free guidance (He et al., 13 Mar 2025).
Controlled diffusion is thus defined by the integration of control signals—be they time-dependent inputs to SDEs, spatial masks, semantic tokens, modular adapters, or reward functions—into the core trajectory or sampling mechanisms of diffusion models. These strategies yield algorithms and pipelines that both optimize information acquisition and enable guided generation with high precision, semantic alignment, and adaptability in complex, high-dimensional problem domains.