Scalable Diffusion Models with Transformers

Updated 27 December 2025

Scalable diffusion models with transformers provide a unified framework that combines efficient attention variants with diffusion generative processes.
They leverage innovations like gated linear attention, sparse MoE layers, and convolutional frontends to reduce computation while maintaining high quality.
Adaptive architectural features, including μP parameterization and robust skip connections, enable high-resolution, multi-modal generative modeling.

Scalable diffusion models with transformer architectures constitute a class of generative models that couple the expressivity and scalability of transformers with the sample quality and training stability of diffusion processes. This synthesis addresses the quadratic cost bottleneck of standard self-attention, enables scaling to billions of parameters and high spatial resolutions, and supports adaptive computation and specialization through architectural innovations. Multiple research threads—linear/time-efficient attention, sparse/distributed MoEs, dimension-aware pre-training, alternative backbone paradigms, and principled scaling strategies—underlie the rapid evolution of this field.

1. Computational Bottlenecks and the Need for Scalable Attention Mechanisms

Standard diffusion transformers, such as DiT, leverage vision-transformer backbones to model denoising in latent or pixel space. However, self-attention's O(N²) complexity—where N is sequence length (e.g., image tokens)—hampers tractable scaling to high resolutions and long contexts. This barrier manifests as prohibitive GPU memory use, slow training/inference, and declining cost–quality tradeoffs beyond 512×512 images (Peebles et al., 2022, Zhu et al., 2024, Li et al., 2024).

To directly overcome this, several linear or sublinear attention replacements have been proposed:

Gated Linear Attention (GLA), central to DiG, replaces softmax(QKᵗ)V with a gated, kernelized variant—Q'=g(Q)⊙φ(Q) and K'=h(K)⊙φ(K)—enabling O(N) forward passes for fixed projection dimension d'≪N (Zhu et al., 2024).
Mamba state-space layers and variants, as exemplified by DiffuApriel and LaMamba-Diff, perform bidirectional or 2D directional linear scans to aggregate global context, with local detail restored via windowed or local self-attention, also at O(N) or O(L) cost (Singh et al., 19 Nov 2025, Fu et al., 2024).
RWKV-based recurrence replaces all-pairs attention with elementwise, global-statistic recurrence, reducing aggregation complexity to O(ND), with each token update involving only small state accumulation and avoiding windowing (Fei et al., 2024).
Token-free convolutional frontends and fixed-size core blocks (e.g., in STOIC) avoid explicit patchification or positional embeddings, using stride-1/2 2D convolutions to encode positional information efficiently and uniformly through the stack (Palit et al., 2024).

These mechanisms enable previously intractable settings (e.g., 2K×2K images, 100K+ sequence lengths in language, on-device deployment), while preserving or improving FID and IS at constant or lower parameter and memory cost relative to canonical transformers.

2. Sparse Mixture-of-Expert Transformers and Adaptive Computation

Scaling transformers to billions or tens of billions of parameters is limited, when dense activation is used, by quadratic compute and high memory utilization. Introducing sparse mixture-of-experts (MoE) within the diffusion transformer architecture provides a solution:

DiT-MoE integrates classical top-K sparse MoE layers into DiT, supplemented by “shared” experts for low-frequency/global information and an explicit expert-balance loss to enforce load balancing. Only K·E+n_s experts are activated per token, lowering inference GFLOPs by 2× compared with dense networks (Fei et al., 2024).
DiffMoE generalizes this approach with a batch-level global token pool for expert selection and a dynamic inference-time capacity predictor, further reducing per-sample FLOPs below 1× dense, while expanding effective parameter count to near 2 B with only 458 M activated on ImageNet, beating dense transformers and classical MoEs in FID and IS (Shi et al., 18 Mar 2025).

Both families show that, in the diffusion context, MoE routing and dynamic capacity allocation enable gigascale models to be trained and deployed efficiently, with performance matching much larger dense models but at substantially reduced compute cost.

3. Architectural Adaptations and Scaling Laws

Empirical work demonstrates that transformer-based diffusion models are highly scalable with respect to both model and data size, but scaling behavior can be qualitatively different across architectural choices:

Depth, width, and token scaling: Increasing transformer width, depth, and number of tokens (by reducing patch size) all lower FID, with the dominant driver being overall GFLOPS. There exists a robust negative log–log scaling of FID with GFLOPS for latent-space models (Peebles et al., 2022). In pixel-space, hierarchical hourglass structures (HDiT) with global + local blocks restore O(N) scaling, enabling training at megapixel resolution (Crowson et al., 2024).
Parameterization strategy: Maximal Update Parametrization (μP) guarantees learning-rate, initialization, and HP stability under width scaling, enabling zero-shot hyperparameter transfer from small to large diffusion transformers, with up to 2.9× faster convergence. The tensor-program theoretical analysis confirms μP applies unchanged for all DiT variants and diffusion objectives (Zheng et al., 21 May 2025).
Skip connections and input representation: Long U-shaped skips are critical in pure self-attention DiTs (U-ViT) for efficient gradient flow and convergence. Self-attention backbones (U-ViT) are shown to scale more smoothly than cross-attention backbones, matching or surpassing SDXL U-Nets at 2 – 3 B parameters (Li et al., 2024).

Empirical scaling (e.g., TIFA or FID vs. log N, model size vs. dataset size) shows clean power-law improvement up to saturation, with diminishing returns above ~3 B parameters and 600 M paired samples (for text-to-image).

4. Conditioning and Modality Generalization

Scalable transformer diffusion models abstract and unify the diffusion process across diverse tasks and conditions by architectural design:

Conditional injection via adaLN-Zero (adaptive LayerNorm with residual scale), cross-attention, or FiLM is crucial. In large-scale diffusion transformers for 4D fMRI, combining AdaLN-Zero and cross-attention is empirically required for condition-specific fidelity in medical synthesis (Seo et al., 28 Nov 2025).
UniDiffuser demonstrates joint training for all distributions over multi-modal data (e.g., image, text, cross-modal), leveraging a single transformer backbone handling multiple conditioning modalities and timestep perturbations. This design enables conditional, marginal, and joint sampling for all data types with minimal architectural modification (Bao et al., 2023).
RAE latents and dimension-aware DiT upscaling: By replacing VAE with frozen, semantically rich representation encoders (e.g., DINO, CLIP, SigLIP) and adapting the transformer width to high-dimension latents, sample quality and convergence are greatly improved; the “DDT head” boosts FID at trivial computational cost (Zheng et al., 13 Oct 2025).

A recurring empirical result is that careful conditioning—via strong normalization, effective residual/skip structure, and explicit cross-attentional injectors—enables both vertical (parameter/data) and horizontal (task/modality) scaling.

5. Theoretical and Practical Efficiency Analysis

Efficiency at high resolution and sequence length is validated through both asymptotic and empirical benchmarks across methodologies:

GLA (DiG): At fixed d′, GLA produces O(N·d·d′) compute per block, empirically yielding 4.2× speedup over Mamba-based models and 2.5× over DiT at 1792×1792, with no FID loss (Zhu et al., 2024).
State-space/SSM backbones (Mamba, RWKV): Linear O(N) complexity is preserved in both language and vision, with bidirectional/2D scans supplementing or replacing explicit attention. Throughput speedups (up to 4.4×) are measured in masked language modeling and image synthesis, with identical or better perplexity and FID (Singh et al., 19 Nov 2025, Fu et al., 2024, Fei et al., 2024).
Token-free and convolutional block designs: Fixed-size, stride-1/2 initial convolution modules with reusable core blocks (STOIC) offer tokenization-free, positional-embedding-free inference with low memory and resource cost, yielding state-of-the-art FID of 1.6 (CelebA) on-device (Palit et al., 2024).
Sparse MoE (DiT-MoE, DiffMoE): FLOPs and memory at inference are pruned to a fraction of dense baselines (<50% in some settings) by activating only routing-selected experts, with negligible router overhead (Fei et al., 2024, Shi et al., 18 Mar 2025).

Empirical studies consistently show that attention-replacement architectures, sparse-expert activation, and convolutional frontends enable transformer-class generative performance at or below the memory/FLOP envelope of UNet backbones for large image models.

6. Limitations, Ablations, and Open Directions

Despite rapid progress in scaling diffusion models with transformers, open questions remain:

The tolerable reduction in attention width d′ for GLA and exact scaling limits as N increases are to be determined. Adaptive, sparse, or learned gating may yield further gains (Zhu et al., 2024).
For MoE backbones, further gains in specialization or hybridization (e.g., heterogeneous experts across modalities) are anticipated, especially in text-to-image/video; integration of advanced regularization could further improve inference cost (Fei et al., 2024).
Scaling laws for loss vs. parameters/data, reliable cross-task transfer, and multi-modal harmonization all remain areas of active research. Deriving explicit analytic scaling relationships is an unresolved theoretical challenge (Li et al., 2024).
Implementation best practices now favor initialization-aware tuning (μP), width-tied scaling in all modules, and early layer/capacity balancing in MoE and shared-expert designs (Zheng et al., 21 May 2025).

In summary, the combination of transformer architectures with scalable and efficient diffusion modeling, using linear attention/SSM mechanisms, sparse mixture-of-experts, and token-free convolutional blocks, defines the state of the art for high-resolution, multi-modal generative models. Across image, video, text, and scientific domains, these methods provide robust scaling, tractable resource requirements, and an extensible architectural basis for future diffusion-based AI systems.

Markdown Upgrade to Chat

References (14)

Scalable Diffusion Models with Transformers (2022)

DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention (2024)

Efficient Scaling of Diffusion Transformers for Text-to-Image Generation (2024)

Breaking the Bottleneck with DiffuApriel: High-Throughput Diffusion LMs with Mamba Backbone (2025)

LaMamba-Diff: Linear-Time High-Fidelity Diffusion Models Based on Local Attention and Mamba (2024)

Diffusion-RWKV: Scaling RWKV-Like Architectures for Diffusion Models (2024)

Scalable, Tokenization-Free Diffusion Model Architectures with Efficient Initial Convolution and Fixed-Size Reusable Structures for On-Device Image Generation (2024)

Scaling Diffusion Transformers to 16 Billion Parameters (2024)

DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers (2025)

10.

Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers (2024)

11.

Scaling Diffusion Transformers Efficiently via $μ$P (2025)

12.

Scalable Diffusion Transformer for Conditional 4D fMRI Synthesis (2025)

13.

One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale (2023)

14.

Diffusion Transformers with Representation Autoencoders (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Scalable Diffusion Models with Transformers.