Fully Convolutional Diffusion Model

Updated 13 March 2026

FCDM is a deep generative model that uses fully convolutional operators to implement diffusion processes without self-attention or Transformer modules.
Its architecture employs U-shaped and hierarchical encoder-decoder designs with innovations like channel attention (CCA/GRN) for enhanced performance and efficiency.
FCDMs offer hardware-efficient training and scalable performance by combining probabilistic diffusion frameworks with inductive biases such as locality, translation equivariance, and patch mosaic mechanisms.

A Fully Convolutional Diffusion Model (FCDM) is a class of deep generative models that realizes the denoising diffusion probabilistic model (DDPM) or related frameworks with a network backbone constructed entirely from convolutional (ConvNet) operators, eliminating self-attention and Transformer modules throughout the architecture. Through U-shaped (encoder-decoder) or hierarchical ConvNet designs, along with architectural and conditioning innovations, FCDMs achieve state-of-the-art sample quality and efficiency, rivaling or exceeding transformer-based diffusion models on large-scale image generation and other tasks. Their inductive bias—locality and translation equivariance—yields distinctive generative and efficiency properties, reshaping the landscape of scalable and hardware-efficient generative modeling.

1. Mathematical Foundations and Modeling Framework

FCDMs instantiate the DDPM framework, parameterizing the forward and reverse stochastic processes with purely convolutional neural networks. The forward (noising) chain is given by: $q(x_{1:T}|x_0)=\prod_{t=1}^T q(x_t|x_{t-1}), \quad q(x_t|x_0)=\mathcal{N}\left(x_t;\sqrt{\bar\alpha_t}x_0,(1-\bar\alpha_t)I\right),$ with $\bar\alpha_t=\prod_{s=1}^t(1-\beta_s)$ , $\beta_t$ the variance schedule.

The time-indexed score network $f(x_t, t)$ is implemented as a translation-equivariant ConvNet; for Denoising Score Matching or DDPM objectives,

$\mathcal{L}_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon}\left[\|\epsilon_\theta(x_t, t) - \epsilon\|_2^2\right],$

with $x_t = \sqrt{\bar\alpha_t} x_0 + \sqrt{1-\bar\alpha_t}\epsilon$ , and $\epsilon \sim \mathcal{N}(0,I)$ . Reverse-time inference is parameterized by the ConvNet noise predictor via the standard $\mu_\theta$ reparameterization.

FCDMs maintain translation equivariance and local receptive field throughout, forming the backbone for the tractable analytic characterization of their generative behavior (Ai et al., 16 May 2025, Kamb et al., 2024).

2. Network Architectures: Hierarchy, Conditioning, and Convolutional Blocks

FCDM architectures span several ConvNet baselines, unified by their avoidance of attention modules and systematic use of spatial and channel mixing via convolution and normalization.

DiCo (Ai et al., 16 May 2025): Utilizes a U-shaped (encoder–bottleneck–decoder) hierarchy with pixel-unshuffle/shuffle for down/upsampling, and skip connections. Core "DiCo blocks" combine $1\times1$ pointwise convolution, $3\times3$ depthwise convolution, GELU nonlinearity, and a compact channel attention (CCA) module. Layer normalization and final convolutional head produce the denoising estimates.
ConvNeXt-inspired FCDM (Kwon et al., 10 Mar 2026): Adopts large $7\times7$ depthwise convolutions, inverted bottleneck (pointwise channel expansion and projection), Global Response Normalization (GRN), and Adaptive LayerNorm (AdaLN) for block-wise conditioning. Network organization is a symmetric 5-stage U-Net, governed by channel and block count scaling law.
DiC (Tian et al., 2024): Employs an encoder–decoder hourglass U-Net built with only $3\times3$ convolutions, GroupNorm and GELU, sparse skip connections (one per stage), and stage-specific, mid-block timestep and conditional injection via AdaLN gating. Efficient block conditioning and skip topologies are crucial for scaling.

All FCDM variants operate directly on VAE latent space (e.g., $32 \times 32 \times 4$ for 256px ImageNet), enabling compute- and memory-efficient training with high-fidelity outputs.

3. Channel Attention and Diversity Mechanisms

Efficient ConvNet backbones often suffer from channel redundancy relative to attention-based models. To address this, FCDMs incorporate lightweight modules to enhance channel utilization:

Compact Channel Attention (CCA) (Ai et al., 16 May 2025):

$\mathrm{CCA}(X) = X \odot \sigma\big(W_p\,\mathrm{GAP}(X)\big), \quad \mathrm{GAP}(X)_k = \frac1{HW}\sum_{i,j} X_{i,j,k}$

Here, $W_p$ is a learned pointwise map, $\sigma$ is the sigmoid, and $\odot$ denotes channel-wise multiplication. CCA drives dynamic, content-dependent channel activation with marginal computational overhead.

Global Response Normalization (GRN) (Kwon et al., 10 Mar 2026): Distributes activation energy across channels, increasing diversity. GRN is parameter-free and improves channel usage versus naive ConvNet stacking.

Ablation studies confirm that omitting channel-diversity modules significantly degrades FID and IS, with CCA/GRN critical for closing the performance gap with transformer baselines.

4. Empirical Performance, Efficiency, and Scalability

FCDMs demonstrate major advantages in hardware efficiency, convergence speed, and sample quality, as shown in head-to-head benchmarks with DiT and other transformer-based diffusion models:

Model	Params (M)	Gflops	Throughput (it/s)	256² FID	512² FID	Notes
DiT-XL/2	~675–784	119	76.9	19.5	20.9	Transformer baseline
DiCo-XL	701	87.3	208.5	2.05	2.53	FCDM + CCA
FCDM-XL	699	65	208.5	10.7	10.2	ConvNeXt backbone
DiC-XL	116.1	57.2*	84.2	3.89	15.32	3x3 Conv, Winograd*

Training cost: FCDM achieves comparable or lower FID in $\sim$ 7× fewer steps and at 50% of the FLOPs relative to DiT (Kwon et al., 10 Mar 2026).
Throughput: FCDM-XL runs at 208.5 it/s (256px) vs DiT's 76.9—it/s, a 2.7× gain (Ai et al., 16 May 2025).
Sample quality: DiCo-H achieves FID 1.90 on ImageNet 256px (1B parameters), surpassing all attention-based comparators (Ai et al., 16 May 2025).

These characteristics are consistent across multiple scales and training regimes, with channel-diversity modules and U-shaped topologies essential for optimal performance.

5. Theoretical Analysis: Locality, Equivariance, and Patch Mosaic Creativity

FCDMs' fully convolutional nature imposes strong inductive biases of locality and translation equivariance. Analytic theory (Kamb et al., 2024) explains this as follows:

The score predictor $M_t[\phi]$ at pixel $x$ depends only on a local $P\times P$ patch $\Omega_x$ , enabling the formulation of Local Score (LS) and Equivariant Local Score (ELS) “machines” as closed-form, Bayes-optimal approximations to the empirical score matching solution, but restricted to local and translation-invariant operations.
Time-dependent patch size $P(t)$ : The effective receptive field of convolutional nets shrinks through the reverse diffusion; $P(t)$ is calibrated such that ELS precisely predicts the outputs of a trained FCDM. Measured $r^2$ between analytic ELS and learned FCDM outputs reaches $0.90$–$0.95$ on standard datasets.
Patch mosaic mechanism: At each pixel, the final generated value is the center-pixel of some $P\times P$ patch from the training set, selected globally but constrained by the patch matching—enabling combinatorial generation of novel images by local patch recombination.
Role of self-attention: When self-attention layers are added, they impose long-range consistency across this patch mosaic, carving out semantically coherent objects/focus from otherwise locally consistent images.

A plausible implication is that the flexibility and sample uniqueness of FCDMs derive from this combinatorial patch mosaic, rather than global feature integration.

6. Application Domains and Specialized Variants

Besides large-scale image generation, FCDM approaches have been adapted for specialized modalities:

Physics-guided FCDMs for Sinogram Inpainting (E et al., 2024):
- Employ bidirectional frequency-domain convolutions (FFT–conv–iFFT blocks) to disentangle sinogram features, integrated with physics-informed losses including absorption consistency and frequency-domain matching.
- Diffusion operates in latent space with frequency-adaptive noise schedules and Fourier-enhanced mask embeddings for angular multiplexing. These components deliver state-of-the-art masked inpainting (SSIM > 0.95, PSNR > 30 dB), with ablations confirming the critical impact of frequency convolution and physics constraints.

7. Ablation Studies and Mechanistic Insights

Systematic ablations across FCDM designs (Ai et al., 16 May 2025, Kwon et al., 10 Mar 2026, Tian et al., 2024) reveal the following mechanisms to be decisive for sample quality and performance:

Architecture shape: U-shaped encoder–decoder hierarchies with skip connections outperform isotropic and shallow alternatives.
Kernel size: Larger depthwise convolutions (5×5, 7×7) incrementally improve FID and IS, with modest compute cost.
Skip connection density: Sparse, stage-wise skips preserve U-Net signal flow while minimizing memory; dense block-wise skips are inefficient and redundant.
Conditioning: Stage-specific timestep embeddings, mid-block conditional injection, AdaLN or GroupNorm gating all confer significant benefits over naïve global or pre-block conditioning.
Channel attention: CCA/GRN-type modules are essential to mitigate channel inactivity, maximizing representational flexibility.

In aggregate, these results challenge the assumption that self-attention is strictly needed for high-fidelity, globally coherent diffusion sampling. Instead, with rigorous convolutional architectural and conditioning design, FCDMs efficiently scale to competitive or superior performance on major benchmarks.

References:

DiCo: Revitalizing ConvNets for Scalable and Efficient Diffusion Modeling (Ai et al., 16 May 2025)
An analytic theory of creativity in convolutional diffusion models (Kamb et al., 2024)
Reviving ConvNeXt for Efficient Convolutional Diffusion Models (Kwon et al., 10 Mar 2026)
DiC: Rethinking Conv3x3 Designs in Diffusion Models (Tian et al., 2024)
FCDM: A Physics-Guided Bidirectional Frequency Aware Convolution and Diffusion-Based Model for Sinogram Inpainting (E et al., 2024)