Papers
Topics
Authors
Recent
Search
2000 character limit reached

Fully Convolutional Diffusion Model

Updated 13 March 2026
  • FCDM is a deep generative model that uses fully convolutional operators to implement diffusion processes without self-attention or Transformer modules.
  • Its architecture employs U-shaped and hierarchical encoder-decoder designs with innovations like channel attention (CCA/GRN) for enhanced performance and efficiency.
  • FCDMs offer hardware-efficient training and scalable performance by combining probabilistic diffusion frameworks with inductive biases such as locality, translation equivariance, and patch mosaic mechanisms.

A Fully Convolutional Diffusion Model (FCDM) is a class of deep generative models that realizes the denoising diffusion probabilistic model (DDPM) or related frameworks with a network backbone constructed entirely from convolutional (ConvNet) operators, eliminating self-attention and Transformer modules throughout the architecture. Through U-shaped (encoder-decoder) or hierarchical ConvNet designs, along with architectural and conditioning innovations, FCDMs achieve state-of-the-art sample quality and efficiency, rivaling or exceeding transformer-based diffusion models on large-scale image generation and other tasks. Their inductive bias—locality and translation equivariance—yields distinctive generative and efficiency properties, reshaping the landscape of scalable and hardware-efficient generative modeling.

1. Mathematical Foundations and Modeling Framework

FCDMs instantiate the DDPM framework, parameterizing the forward and reverse stochastic processes with purely convolutional neural networks. The forward (noising) chain is given by: q(x1:Tx0)=t=1Tq(xtxt1),q(xtx0)=N(xt;αˉtx0,(1αˉt)I),q(x_{1:T}|x_0)=\prod_{t=1}^T q(x_t|x_{t-1}), \quad q(x_t|x_0)=\mathcal{N}\left(x_t;\sqrt{\bar\alpha_t}x_0,(1-\bar\alpha_t)I\right), with αˉt=s=1t(1βs)\bar\alpha_t=\prod_{s=1}^t(1-\beta_s), βt\beta_t the variance schedule.

The time-indexed score network f(xt,t)f(x_t, t) is implemented as a translation-equivariant ConvNet; for Denoising Score Matching or DDPM objectives,

Lsimple=Et,x0,ϵ[ϵθ(xt,t)ϵ22],\mathcal{L}_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon}\left[\|\epsilon_\theta(x_t, t) - \epsilon\|_2^2\right],

with xt=αˉtx0+1αˉtϵx_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1-\bar\alpha_t}\epsilon, and ϵN(0,I)\epsilon \sim \mathcal{N}(0,I). Reverse-time inference is parameterized by the ConvNet noise predictor via the standard μθ\mu_\theta reparameterization.

FCDMs maintain translation equivariance and local receptive field throughout, forming the backbone for the tractable analytic characterization of their generative behavior (Ai et al., 16 May 2025, Kamb et al., 2024).

2. Network Architectures: Hierarchy, Conditioning, and Convolutional Blocks

FCDM architectures span several ConvNet baselines, unified by their avoidance of attention modules and systematic use of spatial and channel mixing via convolution and normalization.

  • DiCo (Ai et al., 16 May 2025): Utilizes a U-shaped (encoder–bottleneck–decoder) hierarchy with pixel-unshuffle/shuffle for down/upsampling, and skip connections. Core "DiCo blocks" combine 1×11\times1 pointwise convolution, 3×33\times3 depthwise convolution, GELU nonlinearity, and a compact channel attention (CCA) module. Layer normalization and final convolutional head produce the denoising estimates.
  • ConvNeXt-inspired FCDM (Kwon et al., 10 Mar 2026): Adopts large 7×77\times7 depthwise convolutions, inverted bottleneck (pointwise channel expansion and projection), Global Response Normalization (GRN), and Adaptive LayerNorm (AdaLN) for block-wise conditioning. Network organization is a symmetric 5-stage U-Net, governed by channel and block count scaling law.
  • DiC (Tian et al., 2024): Employs an encoder–decoder hourglass U-Net built with only 3×33\times3 convolutions, GroupNorm and GELU, sparse skip connections (one per stage), and stage-specific, mid-block timestep and conditional injection via AdaLN gating. Efficient block conditioning and skip topologies are crucial for scaling.

All FCDM variants operate directly on VAE latent space (e.g., 32×32×432 \times 32 \times 4 for 256px ImageNet), enabling compute- and memory-efficient training with high-fidelity outputs.

3. Channel Attention and Diversity Mechanisms

Efficient ConvNet backbones often suffer from channel redundancy relative to attention-based models. To address this, FCDMs incorporate lightweight modules to enhance channel utilization:

CCA(X)=Xσ(WpGAP(X)),GAP(X)k=1HWi,jXi,j,k\mathrm{CCA}(X) = X \odot \sigma\big(W_p\,\mathrm{GAP}(X)\big), \quad \mathrm{GAP}(X)_k = \frac1{HW}\sum_{i,j} X_{i,j,k}

Here, WpW_p is a learned pointwise map, σ\sigma is the sigmoid, and \odot denotes channel-wise multiplication. CCA drives dynamic, content-dependent channel activation with marginal computational overhead.

  • Global Response Normalization (GRN) (Kwon et al., 10 Mar 2026): Distributes activation energy across channels, increasing diversity. GRN is parameter-free and improves channel usage versus naive ConvNet stacking.

Ablation studies confirm that omitting channel-diversity modules significantly degrades FID and IS, with CCA/GRN critical for closing the performance gap with transformer baselines.

4. Empirical Performance, Efficiency, and Scalability

FCDMs demonstrate major advantages in hardware efficiency, convergence speed, and sample quality, as shown in head-to-head benchmarks with DiT and other transformer-based diffusion models:

Model Params (M) Gflops Throughput (it/s) 256² FID 512² FID Notes
DiT-XL/2 ~675–784 119 76.9 19.5 20.9 Transformer baseline
DiCo-XL 701 87.3 208.5 2.05 2.53 FCDM + CCA
FCDM-XL 699 65 208.5 10.7 10.2 ConvNeXt backbone
DiC-XL 116.1 57.2* 84.2 3.89 15.32 3x3 Conv, Winograd*
  • Training cost: FCDM achieves comparable or lower FID in \sim7× fewer steps and at 50% of the FLOPs relative to DiT (Kwon et al., 10 Mar 2026).
  • Throughput: FCDM-XL runs at 208.5 it/s (256px) vs DiT's 76.9—it/s, a 2.7× gain (Ai et al., 16 May 2025).
  • Sample quality: DiCo-H achieves FID 1.90 on ImageNet 256px (1B parameters), surpassing all attention-based comparators (Ai et al., 16 May 2025).

These characteristics are consistent across multiple scales and training regimes, with channel-diversity modules and U-shaped topologies essential for optimal performance.

5. Theoretical Analysis: Locality, Equivariance, and Patch Mosaic Creativity

FCDMs' fully convolutional nature imposes strong inductive biases of locality and translation equivariance. Analytic theory (Kamb et al., 2024) explains this as follows:

  • The score predictor Mt[ϕ]M_t[\phi] at pixel xx depends only on a local P×PP\times P patch Ωx\Omega_x, enabling the formulation of Local Score (LS) and Equivariant Local Score (ELS) “machines” as closed-form, Bayes-optimal approximations to the empirical score matching solution, but restricted to local and translation-invariant operations.
  • Time-dependent patch size P(t)P(t): The effective receptive field of convolutional nets shrinks through the reverse diffusion; P(t)P(t) is calibrated such that ELS precisely predicts the outputs of a trained FCDM. Measured r2r^2 between analytic ELS and learned FCDM outputs reaches $0.90$–$0.95$ on standard datasets.
  • Patch mosaic mechanism: At each pixel, the final generated value is the center-pixel of some P×PP\times P patch from the training set, selected globally but constrained by the patch matching—enabling combinatorial generation of novel images by local patch recombination.
  • Role of self-attention: When self-attention layers are added, they impose long-range consistency across this patch mosaic, carving out semantically coherent objects/focus from otherwise locally consistent images.

A plausible implication is that the flexibility and sample uniqueness of FCDMs derive from this combinatorial patch mosaic, rather than global feature integration.

6. Application Domains and Specialized Variants

Besides large-scale image generation, FCDM approaches have been adapted for specialized modalities:

  • Physics-guided FCDMs for Sinogram Inpainting (E et al., 2024):
    • Employ bidirectional frequency-domain convolutions (FFT–conv–iFFT blocks) to disentangle sinogram features, integrated with physics-informed losses including absorption consistency and frequency-domain matching.
    • Diffusion operates in latent space with frequency-adaptive noise schedules and Fourier-enhanced mask embeddings for angular multiplexing. These components deliver state-of-the-art masked inpainting (SSIM > 0.95, PSNR > 30 dB), with ablations confirming the critical impact of frequency convolution and physics constraints.

7. Ablation Studies and Mechanistic Insights

Systematic ablations across FCDM designs (Ai et al., 16 May 2025, Kwon et al., 10 Mar 2026, Tian et al., 2024) reveal the following mechanisms to be decisive for sample quality and performance:

  • Architecture shape: U-shaped encoder–decoder hierarchies with skip connections outperform isotropic and shallow alternatives.
  • Kernel size: Larger depthwise convolutions (5×5, 7×7) incrementally improve FID and IS, with modest compute cost.
  • Skip connection density: Sparse, stage-wise skips preserve U-Net signal flow while minimizing memory; dense block-wise skips are inefficient and redundant.
  • Conditioning: Stage-specific timestep embeddings, mid-block conditional injection, AdaLN or GroupNorm gating all confer significant benefits over naïve global or pre-block conditioning.
  • Channel attention: CCA/GRN-type modules are essential to mitigate channel inactivity, maximizing representational flexibility.

In aggregate, these results challenge the assumption that self-attention is strictly needed for high-fidelity, globally coherent diffusion sampling. Instead, with rigorous convolutional architectural and conditioning design, FCDMs efficiently scale to competitive or superior performance on major benchmarks.


References:

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fully Convolutional Diffusion Model (FCDM).