- The paper proposes ΔConvFusion, a novel architecture that distills self-attention into convolutional blocks to significantly reduce computational complexity in diffusion models.
- It employs a dual-branch design with pyramid convolution and average pooling, combined with feature- and output-level losses, to effectively mimic localized attention patterns.
- Experiments on SD1.5, SDXL, and PixArt demonstrate up to 6929× FLOPs reduction and improved inference latency while maintaining robust image quality and text-image alignment.
This paper, "Can We Achieve Efficient Diffusion without Self-Attention? Distilling Self-Attention into Convolutions" (2504.21292), investigates the role of self-attention in modern diffusion models and proposes a more efficient convolutional alternative. The core motivation stems from the observation that self-attention, a key component in U-Net and Diffusion Transformer (DiT) architectures, exhibits quadratic computational complexity with respect to image resolution. This becomes a major bottleneck for generating high-resolution images.
Contrary to the common assumption that self-attention is crucial for capturing global relationships, the authors' analysis reveals that self-attention in pre-trained diffusion models primarily exhibits localized attention patterns. They demonstrate this through visualizations of attention maps and quantitative analysis, showing that attention strength decays rapidly with distance from the query pixel (high-frequency, distance-dependent component) and also contains a broad, spatially invariant bias (low-frequency component). Effective Receptive Field (ERF) analysis further supports this, showing that ERFs are relatively small (e.g., below 15x15 or 20x20 for most layers in PixArt and SD1.5). An ablation study replacing all self-attention layers with Neighborhood Attention (a localized mechanism) also maintained image quality, suggesting global interactions aren't strictly necessary in every block.
Based on these findings, the paper proposes ΔConvFusion, an architecture that replaces self-attention modules with a novel convolutional block called ΔConvBlock. The ΔConvBlock is specifically designed to mimic the two observed components of self-attention:
- Pyramid Convolution: This branch captures the high-frequency, distance-dependent signal using a multi-scale convolutional structure. It processes the input feature map through multiple "pyramid stages" (Δi). Each stage involves downsampling (average pooling), a non-linear scaled simple gate (implemented with depth-wise convolutions to improve numerical stability), and upsampling (bilinear interpolation). By summing the outputs of stages with different downsampling factors (2i), the block achieves diverse receptive fields, allowing pixels closer to the center to accumulate higher weights, similar to the observed decay in self-attention.
Given an input feature map z~tl∈RH′×W′×C′, it is first passed through a 1×1 convolution ψin to reduce channels: zt,inl=ψin(z~tl). The pyramid convolution is then $\sum_{i=1}^{n} \uparrow2^i(\rho(\downarrow2^i(\mathbf{z}^{l}_{t,\text{in})))$, where Δ0 is average pooling, Δ1 is bilinear interpolation, and Δ2 is the scaled simple gate Δ3.
- Average Pooling Branch: This component captures the low-frequency, spatially invariant bias. It consists of a global average pooling layer followed by a Δ4 convolutional layer Δ5. This operation provides a global context vector Δ6 that is added to the output of the pyramid convolution branch.
The total output of the Δ7ConvBlock is the sum of the outputs from the pyramid convolution branch and the average pooling branch, potentially followed by an output Δ8 convolution.
For practical implementation, the authors propose an efficient training strategy: replacing self-attention blocks in a pre-trained diffusion model with initialized Δ9ConvBlocks and only training the parameters of these new blocks, keeping the rest of the model frozen. Knowledge distillation is used to transfer the functionality of the original self-attention modules. This involves two loss terms:
- Feature-level loss (Δ0): Minimizes the L2 distance between the outputs of the Δ1ConvBlock and the original self-attention module for each layer. Δ2.
- Output-level loss (Δ3): Uses the Δ4-prediction objective of the diffusion model, weighted by the Min-SNR loss to accelerate convergence. Δ5.
The overall loss is Δ6.
Experiments were conducted on SD1.5 (U-Net), SDXL (U-Net), and PixArt (DiT) architectures. For SD1.5, Δ7ConvBlocks used a kernel size of Δ8, while for SDXL and PixArt, Δ9 was used in the effective receptive field estimation guiding the design. Training was done on synthetic (Midjourney) and curated real images, and evaluation on high-aesthetic LAION images using DINOv2 Score (DS), Fréchet DINOv2 Distance (FDD), and CLIP Score.
The results demonstrate significant computational efficiency improvements:
- FLOPs: Δi0ConvFusion achieves drastically lower FLOPs compared to self-attention across all tested resolutions, reaching up to 6929Δi1 reduction at 16K for SD1.5. Compared to LinFusion (a state-of-the-art efficient method), Δi2ConvFusion shows lower FLOPs at common resolutions (512x512, 1024x1024) and maintains competitive efficiency at higher resolutions while offering linear scalability.
- Inference Latency: Δi3ConvFusion shows notable speedups (e.g., 3.4Δi4 over LinFusion at 1024x1024), outperforming methods that still rely on less memory-efficient global mechanisms, even with larger effective kernel sizes. This indicates better memory efficiency in addition to FLOPs reduction.
Crucially, these efficiency gains do not come at the cost of performance:
- Image Quality & Realism: Δi5ConvFusion achieves comparable or superior DS (quality/diversity) and FDD (realism) scores compared to original self-attention models and LinFusion, across different base models and resolutions.
- Text-Image Alignment: Δi6ConvFusion shows superior CLIP scores, suggesting better semantic consistency with text prompts, implying the hierarchical features learned by Δi7ConvBlocks are effective for this task.
- Cross-Resolution Generation: Δi8ConvFusion models trained at a lower resolution (e.g., 512x512) generalize much better to higher resolutions (e.g., 1024x1024) than their self-attention counterparts, which often produce fragmented results.
In summary, the paper presents a compelling case that the primary benefit of self-attention in diffusion models is its ability to model localized spatial interactions, which can be efficiently replicated by a structured convolutional design like the Δi9ConvBlock. By distilling the behavior of self-attention into convolutions, 2i0ConvFusion offers a practical path to deploying efficient diffusion models that achieve high generation quality and semantic coherence with substantially reduced computational cost and latency, especially beneficial for high-resolution synthesis. The approach is modular and can be applied to replace self-attention in existing U-Net and DiT-based diffusion models through efficient distillation training.