Flexible Diffusion Models Overview
- Flexible Diffusion Models are generative frameworks that incorporate modular designs, parameterized SDEs, and adaptive conditioning to handle diverse data types and tasks.
- They extend classical DDPMs by introducing dynamic mechanisms, such as variable tokenization and plug-and-play control, to optimize performance in complex applications.
- Applications span multitask robotics, variable-length sequence generation, and multimodal fusion, achieving improved metrics and robustness compared to fixed-structure diffusion models.
A Flexible Diffusion Model refers to a class of generative or conditional models in which the core diffusion or score-based process is endowed with architectural, algorithmic, or mathematical flexibility that enables adaptation to a broad range of tasks, data types, or conditioning scenarios. The paradigm extends beyond canonical, fixed-structure denoising diffusion probabilistic models (DDPMs) by introducing modularity, dynamic composition, parameterized SDEs or Markov chains, input-agnostic architectures, or adaptive conditioning strategies. This flexibility is pivotal for challenging domains such as multitask robotics, sequence modeling with varying lengths, MJPEG/MED fusion, graph learning over heterophilic structures, and high-dimensional planning, among others.
1. Mathematical Definitions and General Frameworks
At the heart of flexible diffusion models lies the stochastic process connecting noise and data distributions, typically formalized through SDEs, Markov chains, or discrete interpolants. The standard DDPM framework evolves data via a parameterized forward (noising) process and reverses this evolution via a neural network approximator. Generalizations introduce additional flexibility in both the forward process and the structure of the reverse process:
- Parameterization of the Forward SDE: Abstracted as
with (drift) and (diffusion) allowed to depend on location and time. Notably, A Flexible Diffusion Model proposes to parameterize and with neural networks (e.g., via a spatially varying Riemannian metric and symplectic form), yielding a rich family capable of simulating classical VP, VE, sub-VP, and critically-damped Langevin SDEs (Du et al., 2022).
- Discrete and Masked Diffusion on Sequences: For structured discrete data, Any-Order Flexible Length Masked Diffusion presents a CTMC where insertions, deletions, and unmasking events are governed by dynamic schedules , enabling true variable-length and arbitrary-order autoregressive or noncausal generation (Kim et al., 31 Aug 2025).
- Modular and Product-of-Experts Compositions: In multitask action modeling, Flexible Multitask Learning with Factorized Diffusion Policy constructs the global action distribution as a product of component diffusion models,
with determined by a learned router. Each specializes to a behavioral submode, so the flexibility arises from modular composition and adaptive expert weighting (Liu et al., 26 Dec 2025).
2. Architectural and Algorithmic Mechanisms for Flexibility
Architectural flexibility manifests as modularity, decoupled conditioning branches, dynamic tokenization, or selective control of model capacity and input structure:
- Modular Diffusion Policies: Each component (expert) is a distinct DDPM trained on a submode of behavior. A router network outputs observation-dependent softmax weights, ensuring all experts contribute proportionally during sampling and training. New experts can be introduced by "upcycling" existing module weights, freezing the rest to prevent catastrophic forgetting (Liu et al., 26 Dec 2025).
- Plug-and-Play Conditional Control: EasyControl implements condition-specific LoRA adapters, allowing new spatial, textual, or semantic controls to be injected into a pretrained Diffusion Transformer (DiT) backbone without modifying core weights. This strategy enables zero-shot multi-condition generalization and fine-grained, flexible control at minimal parameter overhead (Zhang et al., 10 Mar 2025).
- Variable Tokenization and Dynamic Positional Encoding: In imaging, FiT and FiTv2 process arbitrary aspect ratio, variable-resolution images by modeling latents as dynamic-length token sequences, using customized rotary positional encodings and masked attention. Training and inference both accommodate sequences up to a maximum (e.g., ), with padding/masking strategies to ensure proper context handling (Wang et al., 2024, Lu et al., 2024).
3. Flexible Conditioning, Adaptation, and Input Modality
A hallmark of flexible diffusion models is the ability to condition on arbitrary or dynamically specified context, enforce spatial or logical constraints, or seamlessly adapt to new data modalities:
- Adaptive or Input-Agnostic Fusion: FlexiD-Fuse enables n-input medical image fusion by reformulating conditioned reverse diffusion as a maximum-likelihood estimation under a hierarchical Bayesian model, embedding an Expectation–Maximization (EM) inner loop at each step. Empirically, the same model accommodates both bi-modal and tri-modal cases (and beyond) without retraining, maintaining state-of-the-art performance (Xu et al., 11 Sep 2025).
- Flexible Sequence Modeling: FlexMDMs allow any-order, variable-length generation in masked diffusion for discrete sequences. The model supports both insertions (sampling the origination of new masked tokens) and unmaskings, with rate functions parameterized by neural heads, thus matching true data length statistics—an essential flexibility for text generation, code infilling, and planning tasks (Kim et al., 31 Aug 2025).
- Representation Guidance and Curriculum: Learning Diffusion Models with Flexible Representation Guidance details a variational framework where feature representations from pre-trained or cross-modal encoders are injected at flexible reverse steps, with schedules and curriculum devised for optimal balance between denoising and semantic alignment. This brings flexibility over the injection point and the modality/task split, enabling faster convergence and improved sample quality for images, proteins, and molecules (Wang et al., 11 Jul 2025).
4. Test-Time Adaptation, Planning, and Control Applications
Flexible diffusion models have seen adoption in planning, predictive control, and reinforcement learning, where adaptation to new objectives, constraints, or environmental requirements is essential:
- Diffusion-MPC for Locomotion and Planning: In Flexible Locomotion Learning with Diffusion Model Predictive Control, planning is accomplished by sampling from a diffusion model prior, then introducing reward-based gradients and online constraint projections at each reverse step. Adaptation to new objectives is enabled by directly editing reward functions or constraints at test time, without additional retraining (Huang et al., 5 Oct 2025).
- Classifier/Gaussian Guidance and Energy-Based Steering: Autonomous driving and manipulation settings leverage classifier guidance not only for task optimality but also for safety, comfort, or multimodal intent. In Diffusion-Based Planning for Autonomous Driving with Flexible Guidance, an energy-based guidance vector is introduced into the score function, capable of enforcing collision avoidance, speed limits, drivable area constraints, or comfort through analytic or neural energy functionals (Zheng et al., 26 Jan 2025).
- Flexible Keyframe In-betweening and Motion Synthesis: For character animation, models such as CondMDI can accept arbitrarily sparse or dense keyframe constraints along time and joint axes (including partial pose information), and fuse them with soft text prompts. The masking-based conditioning design enables universal input patterns, outperforming rigid broadcasting or cross-attention layers (Cohan et al., 2024).
5. Empirical Performance, Limitations, and Comparative Results
Direct empirical results across domains reinforce the criticality of flexibility for both modeling capacity and efficient adaptation:
- In multitask robotics, Factorized Diffusion Policy outperforms monolithic and mixture-of-experts baselines by up to +18% absolute success rate in RLBench after new-task adaptation; adaptation by adding a new module achieves 85% real-world success without degrading previously acquired skills (Liu et al., 26 Dec 2025).
- In masked discrete sequence modeling, FlexMDMs match baseline perplexity but achieve up to ∼60% higher task success in complex planning (maze layout) tasks, while supporting genuine variable length—solving a core limitation of fixed-length MDMs (Kim et al., 31 Aug 2025).
- For high-dimensional image composition, both FiT and FiTv2 deliver state-of-the-art FID across in-domain and OOD resolutions/aspect ratios, with FiTv2’s rapid convergence due to its rectified flow scheduler and adaptive normalization innovations (Wang et al., 2024, Lu et al., 2024).
- In multimodal fusion, FlexiD-Fuse achieves leading scores across 8–9 biomedical metrics and remains robust in non-medical (infrared-visible/exposure/focus fusion) settings (Xu et al., 11 Sep 2025).
6. Theoretical Properties and Guarantees
Flexible diffusion models are often supported by precise theoretical guarantees regarding expressivity and correctness:
- A Flexible Diffusion Model formalizes the stationary law of a generalized SDE parameterized by a neural Riemannian metric and/or symplectic form, showing stationarity at standard normal and demonstrating that classical VP/VE/sub-VP and critically-damped Langevin models are recoverable special cases (Du et al., 2022).
- FlexMDMs prove, via CTMC theory, that adaptive any-order inference yields exact marginal samples given accurate unmask and insertion heads, and that the variational cross-entropy loss upper-bounds the final KL divergence (Kim et al., 31 Aug 2025).
- In graph learning, Flexible Diffusion Scopes with Parameterized Laplacian provides an order-preserving theorem linking spectral and diffusion distance, showing that parameterized Laplacians control global/local mixing and can be tuned to maximize accuracy under varying homophily (Lu et al., 2024).
7. Broader Context and Future Directions
Flexible diffusion modeling underpins a trend towards adaptive, modular, and input-agnostic generative frameworks. Essential directions include:
- Extending flexible compositionality to cross-domain scenarios (e.g., vision-language, trajectory-text fusion).
- Refinement of plug-and-play adaptation mechanisms, aiming for sublinear adaptation time and negligible interference.
- Fusion and aggregation in higher-order, non-Euclidean, or multi-relational data spaces, leveraging the theory of parameterized SDEs or CTMCs.
The rapidly developing ecosystem—incorporating multitask modularity, variable-length sequence generation, input-agnostic fusion, and plug-and-play control—is increasingly fundamental to the adoption of generative diffusion models in real-world, dynamically evolving, or multimodal applications (Liu et al., 26 Dec 2025, Xu et al., 11 Sep 2025, Wang et al., 2024, Zheng et al., 26 Jan 2025, Wang et al., 11 Jul 2025).