Can We Achieve Efficient Diffusion without Self-Attention? Distilling Self-Attention into Convolutions
(2504.21292v1)
Published 30 Apr 2025 in cs.CV
Abstract: Contemporary diffusion models built upon U-Net or Diffusion Transformer (DiT) architectures have revolutionized image generation through transformer-based attention mechanisms. The prevailing paradigm has commonly employed self-attention with quadratic computational complexity to handle global spatial relationships in complex images, thereby synthesizing high-fidelity images with coherent visual semantics.Contrary to conventional wisdom, our systematic layer-wise analysis reveals an interesting discrepancy: self-attention in pre-trained diffusion models predominantly exhibits localized attention patterns, closely resembling convolutional inductive biases. This suggests that global interactions in self-attention may be less critical than commonly assumed.Driven by this, we propose (\Delta)ConvFusion to replace conventional self-attention modules with Pyramid Convolution Blocks ((\Delta)ConvBlocks).By distilling attention patterns into localized convolutional operations while keeping other components frozen, (\Delta)ConvFusion achieves performance comparable to transformer-based counterparts while reducing computational cost by 6929$\times$ and surpassing LinFusion by 5.42$\times$ in efficiency--all without compromising generative fidelity.
This paper, "Can We Achieve Efficient Diffusion without Self-Attention? Distilling Self-Attention into Convolutions" (Dong et al., 30 Apr 2025), investigates the role of self-attention in modern diffusion models and proposes a more efficient convolutional alternative. The core motivation stems from the observation that self-attention, a key component in U-Net and Diffusion Transformer (DiT) architectures, exhibits quadratic computational complexity with respect to image resolution. This becomes a major bottleneck for generating high-resolution images.
Contrary to the common assumption that self-attention is crucial for capturing global relationships, the authors' analysis reveals that self-attention in pre-trained diffusion models primarily exhibits localized attention patterns. They demonstrate this through visualizations of attention maps and quantitative analysis, showing that attention strength decays rapidly with distance from the query pixel (high-frequency, distance-dependent component) and also contains a broad, spatially invariant bias (low-frequency component). Effective Receptive Field (ERF) analysis further supports this, showing that ERFs are relatively small (e.g., below 15x15 or 20x20 for most layers in PixArt and SD1.5). An ablation paper replacing all self-attention layers with Neighborhood Attention (a localized mechanism) also maintained image quality, suggesting global interactions aren't strictly necessary in every block.
Based on these findings, the paper proposes ΔConvFusion, an architecture that replaces self-attention modules with a novel convolutional block called ΔConvBlock. The ΔConvBlock is specifically designed to mimic the two observed components of self-attention:
Pyramid Convolution: This branch captures the high-frequency, distance-dependent signal using a multi-scale convolutional structure. It processes the input feature map through multiple "pyramid stages" (Δi). Each stage involves downsampling (average pooling), a non-linear scaled simple gate (implemented with depth-wise convolutions to improve numerical stability), and upsampling (bilinear interpolation). By summing the outputs of stages with different downsampling factors (2i), the block achieves diverse receptive fields, allowing pixels closer to the center to accumulate higher weights, similar to the observed decay in self-attention.
Given an input feature map z~tl∈RH′×W′×C′, it is first passed through a 1×1 convolution ψin to reduce channels: zt,inl=ψin(z~tl). The pyramid convolution is then $\sum_{i=1}^{n} \uparrow2^i(\rho(\downarrow2^i(\mathbf{z}^{l}_{t,\text{in})))$, where ↓2i is average pooling, ↑2i is bilinear interpolation, and ρ is the scaled simple gate ρ(f)=C′f<C′/2⋅f≥C′/2.
Average Pooling Branch: This component captures the low-frequency, spatially invariant bias. It consists of a global average pooling layer followed by a 1×1 convolutional layer ψp. This operation provides a global context vector foutavg=ψp(GlobalAvgPool(z~tl)) that is added to the output of the pyramid convolution branch.
The total output of the ΔConvBlock is the sum of the outputs from the pyramid convolution branch and the average pooling branch, potentially followed by an output 1×1 convolution.
For practical implementation, the authors propose an efficient training strategy: replacing self-attention blocks in a pre-trained diffusion model with initialized ΔConvBlocks and only training the parameters of these new blocks, keeping the rest of the model frozen. Knowledge distillation is used to transfer the functionality of the original self-attention modules. This involves two loss terms:
Feature-level loss (Lf): Minimizes the L2 distance between the outputs of the ΔConvBlock and the original self-attention module for each layer. Lf=l=1∑N∥Δθl(ztl)−zt,outl∥2.
Output-level loss (Lz): Uses the ϵ-prediction objective of the diffusion model, weighted by the Min-SNR loss to accelerate convergence. Lz=min(γ⋅(σtz/σtϵ)2,1)⋅(∥ϵ~−ϵ^∥2+∥ϵ−ϵ^∥2).
The overall loss is L=Lz+βLf.
Experiments were conducted on SD1.5 (U-Net), SDXL (U-Net), and PixArt (DiT) architectures. For SD1.5, ΔConvBlocks used a kernel size of K=9, while for SDXL and PixArt, K=13 was used in the effective receptive field estimation guiding the design. Training was done on synthetic (Midjourney) and curated real images, and evaluation on high-aesthetic LAION images using DINOv2 Score (DS), Fréchet DINOv2 Distance (FDD), and CLIP Score.
The results demonstrate significant computational efficiency improvements:
FLOPs:ΔConvFusion achieves drastically lower FLOPs compared to self-attention across all tested resolutions, reaching up to 6929× reduction at 16K for SD1.5. Compared to LinFusion (a state-of-the-art efficient method), ΔConvFusion shows lower FLOPs at common resolutions (512x512, 1024x1024) and maintains competitive efficiency at higher resolutions while offering linear scalability.
Inference Latency:ΔConvFusion shows notable speedups (e.g., 3.4× over LinFusion at 1024x1024), outperforming methods that still rely on less memory-efficient global mechanisms, even with larger effective kernel sizes. This indicates better memory efficiency in addition to FLOPs reduction.
Crucially, these efficiency gains do not come at the cost of performance:
Image Quality & Realism:ΔConvFusion achieves comparable or superior DS (quality/diversity) and FDD (realism) scores compared to original self-attention models and LinFusion, across different base models and resolutions.
Text-Image Alignment:ΔConvFusion shows superior CLIP scores, suggesting better semantic consistency with text prompts, implying the hierarchical features learned by ΔConvBlocks are effective for this task.
Cross-Resolution Generation:ΔConvFusion models trained at a lower resolution (e.g., 512x512) generalize much better to higher resolutions (e.g., 1024x1024) than their self-attention counterparts, which often produce fragmented results.
In summary, the paper presents a compelling case that the primary benefit of self-attention in diffusion models is its ability to model localized spatial interactions, which can be efficiently replicated by a structured convolutional design like the ΔConvBlock. By distilling the behavior of self-attention into convolutions, ΔConvFusion offers a practical path to deploying efficient diffusion models that achieve high generation quality and semantic coherence with substantially reduced computational cost and latency, especially beneficial for high-resolution synthesis. The approach is modular and can be applied to replace self-attention in existing U-Net and DiT-based diffusion models through efficient distillation training.