Papers
Topics
Authors
Recent
Search
2000 character limit reached

Transformer-Based Diffusion Model (TDDPM)

Updated 28 May 2026
  • TDDPM is a generative model that combines the denoising diffusion framework with transformer architectures to capture long-range dependencies across diverse modalities.
  • The approach utilizes pure transformer and hybrid UNet-transformer designs to boost performance in tasks like image synthesis, restoration, and 3D scene generation.
  • Advanced conditioning, efficient attention modulation, and multi-modal tokenization enable improved empirical results and scalability while reducing computational overhead.

A Transformer-Based Diffusion Model (TDDPM) is a class of generative models that integrates the denoising diffusion probabilistic model (DDPM) framework with architectures based on the transformer mechanism. These models replace or augment the conventional convolutional (typically UNet) backbone with self-attention–centric transformer networks in both the forward (noising) and reverse (denoising or generation) processes. TDDPMs have shown strong empirical advantages in modeling long-range dependencies across a broad array of data types, including images, time series, layouts, medical volumes, 3D scenes, and cross-modal structured outputs.

1. Theoretical Foundations and Core Formulation

TDDPMs inherit the mathematical foundation of DDPMs. The forward process is a Markov chain that adds incrementally scaled Gaussian noise to clean data x0x_0 over TT steps:

q(xtxt1)=N(xt;1βtxt1,βtI),t=1Tq(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}\,x_{t-1}, \beta_t I), \quad t=1 \ldots T

A key property is the closed analytic posterior from x0x_0:

q(xtx0)=N(xt;αˉtx0,(1αˉt)I),with αˉt=s=1t(1βs)q(x_t|x_0) = \mathcal{N}(x_t;\,\sqrt{\bar\alpha_t}\,x_0,\,(1-\bar\alpha_t)I), \quad \text{with}~\bar\alpha_t = \prod_{s=1}^t(1-\beta_s)

The reverse process is parameterized by a neural network ϵθ(xt,t,c)\epsilon_\theta(x_t, t, c) (where dependence on auxiliary condition cc is optional):

pθ(xt1xt,c)=N(xt1;μθ(xt,t,c),σt2I)p_\theta(x_{t-1}|x_t, c) = \mathcal{N}(x_{t-1};\, \mu_\theta(x_t, t, c), \sigma_t^2 I)

μθ(xt,t,c)=1αt(xt1αt1αˉtϵθ(xt,t,c))\mu_\theta(x_t, t, c) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}}\,\epsilon_\theta(x_t, t, c) \right)

The simplified training loss is

Et,x0,ϵϵϵθ(xt,t,c)p\mathbb{E}_{t, x_0, \epsilon} \left\| \epsilon - \epsilon_\theta(x_t, t, c) \right\|^p

with TT0 for most image domains but often TT1 for sharper recovery (e.g., in image restoration) (Anwar et al., 25 Jun 2025, Tang et al., 2023).

2. Transformer Integration Architectures

TDDPMs may insert transformers at various loci:

Multi-modal TDDPMs (e.g., UniDiffuser, DiffSurf) concatenate or interleave modality-specific tokens, inject per-modality timestep embeddings, and use shared transformer attention to entangle signals across modalities and time (Bao et al., 2023, Yoshiyasu et al., 2024).

3. Conditioning, Modulation, and Guidance

TDDPMs exhibit diverse conditioning and guidance strategies:

4. Applications and Empirical Performance

Visual Content Generation and Restoration

  • Image Synthesis: Pure transformer backbones (e.g., DiT/SDiT) achieve competitive or superior FID/IS, particularly under large-scale settings. TransDiff (Zhen et al., 11 Jun 2025) combines AR-Transformer and diffusion to reach FID 1.42 on ImageNet 256×256 in ≈1s with MRAR sampling.
  • Restoration Tasks: TDiR (Anwar et al., 25 Jun 2025) and underwater TDDPM (Tang et al., 2023) leverage transformer-U-Net or shallow transformer denoisers: e.g., TDiR surpasses MetaUE and baseline PromptIR on underwater benchmarks, while TDDPM achieves +3.5dB PSNR over prior transformer-enhanced CNNs at 5×–10× runtime speedup using DDIM and skip strategies.
  • Layout Generation: LayoutDM (Chai et al., 2023) deploys a pure-set transformer for layout element denoising, outperforming U-Net and GAN/VAE rivals in FID, alignment, and content diversity.

High-dimensional Data

  • 3D Shape and Scene Generation: DiffTF (Cao et al., 2023) uses triplane representations and a cross-plane transformer to achieve state-of-the-art diversity and fidelity over 200+ 3D object categories. DiffSurf (Yoshiyasu et al., 2024) generalizes joint transformer diffusion to surface meshes and skeletal joints, outperforming prior mesh generative baselines in diversity (1-NNA), fitting accuracy (PA-MPJPE), and inference speed (≈21 fps, N=431).
  • Medical Imaging: MC-DDPM (Pan et al., 2023) integrates Swin transformer blocks for MRI-to-CT translation, leading to lower MAE and higher SSIM than GAN and CNN-based DDPMs. MedSegDiff-V2 (Wu et al., 2023) combines UNet backbone with transformer-based Anchor and SS-Former modules, achieving consistently higher Dice scores across 20 segmentation datasets.

Sequential and Graph Data

  • Time Series: TDSTF (Chang et al., 2023) and Diffusion-TS (Yuan et al., 2024) use encoder–decoder transformers for probabilistic forecasting, imputation, and interpretable sample generation, topping prior RNNs and diffusion models in MSE and c-FID on medical and multivariate benchmarks.
  • Graph learning: DIFFormer (Wu et al., 2023) parameterizes diffusion-induced attention matrices with closed-form optimality, improving node classification and semi-supervised image/text classification compared to GCN/GAT, GNNs, and prior transformers.

5. Model Efficiency, Scalability, and Specializations

Recent TDDPM work systematically addresses the higher computational burden of transformer backbones:

  • Attention Modulation Matrix: EDT (Chen et al., 2024) introduces AMM to decorrelate attention locality, improving detail fidelity and reducing computational cost (up to 3.9× training speedup and 2.3× inference over MDTv2).
  • Layer-wise and cross-scale efficiency: Swin-Vnet leverages 3D shifted window attention to combine low-res transformer and high-res convolution, effectively modeling volumetric dependencies for brain and prostate sCT (Pan et al., 2023).
  • Spiking Adaptation: SDiT (Yang et al., 2024) merges transformer blocks with LIF spiking neurons, producing competitive FID/IS with fewer diffusion steps and lower multiply–accumulate counts, facilitating deployment on neuromorphic hardware.
  • Diffusion Policy in Control: MTDP (Wang et al., 13 Feb 2025) employs modulated transformer blocks for denoising policies, enabling faster sampling with DDIM while maintaining or improving manipulation success rates versus previous transformer and UNet baselines.

6. Unified and Multi-Modal Diffusion with Transformers

UniDiffuser (Bao et al., 2023) demonstrates a single transformer backbone operating on multi-modal data (image, text) by encoding each modality as separate token streams, injecting per-modality timestep embeddings, and jointly predicting noise vectors. By varying which modalities are clean/noisy per sample, UniDiffuser supports unconditional, marginal, conditional, and joint generation in a single, parameter-efficient model, matching or exceeding FID and CLIP scores of task-specialized systems (e.g., Stable Diffusion, DALL-E 2).

Generalizations such as DiffSurf (Yoshiyasu et al., 2024) and cross-plane diffusion over 3D-aware representations (Cao et al., 2023) illustrate the flexibility of TDDPMs in handling multiple heterogeneously structured modalities, with conditioning at inference accomplished by freezing some input streams at TT2.

7. Limitations, Runtime, and Future Directions

While TDDPMs bring architectural flexibility and domain-generalization, their autoregressive/attention complexities can impose substantial memory and compute overhead. Recent methods tackle this via efficient attention modulation (EDT), shallow or bottleneck transformer blocks (e.g., MedSegDiff-V2, PromptIR), or fast deterministic samplers (DDIM, DPM-Solver).

Empirical performance frequently demonstrates that transformer-based denoisers can match or surpass the best convolutional methods across diverse tasks and domains, especially as data and model scales increase (Zhen et al., 11 Jun 2025, Chai et al., 2023, Bao et al., 2023). However, careful integration of transformer layers (location, conditioning, normalization), domain-specific tokenization, and computational optimizations (AMM, masking, cross-scale fusion) remain critical for stable convergence and practical training/deployment.

Current research advances TDDPMs through increased multi-modality, interpretability (trend/seasonal disentanglement), data efficiency (energy-aware diffusion attention), and semantics-driven policy generation, marking TDDPMs as a central technology for general-purpose generative modeling across modalities and tasks.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Transformer-Based Diffusion Model (TDDPM).