Can We Achieve Efficient Diffusion without Self-Attention? Distilling Self-Attention into Convolutions

Published 30 Apr 2025 in cs.CV | (2504.21292v1)

Abstract: Contemporary diffusion models built upon U-Net or Diffusion Transformer (DiT) architectures have revolutionized image generation through transformer-based attention mechanisms. The prevailing paradigm has commonly employed self-attention with quadratic computational complexity to handle global spatial relationships in complex images, thereby synthesizing high-fidelity images with coherent visual semantics.Contrary to conventional wisdom, our systematic layer-wise analysis reveals an interesting discrepancy: self-attention in pre-trained diffusion models predominantly exhibits localized attention patterns, closely resembling convolutional inductive biases. This suggests that global interactions in self-attention may be less critical than commonly assumed.Driven by this, we propose (\Delta)ConvFusion to replace conventional self-attention modules with Pyramid Convolution Blocks ((\Delta)ConvBlocks).By distilling attention patterns into localized convolutional operations while keeping other components frozen, (\Delta)ConvFusion achieves performance comparable to transformer-based counterparts while reducing computational cost by 6929$\times$ and surpassing LinFusion by 5.42$\times$ in efficiency--all without compromising generative fidelity.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper proposes ΔConvFusion, a novel architecture that distills self-attention into convolutional blocks to significantly reduce computational complexity in diffusion models.
It employs a dual-branch design with pyramid convolution and average pooling, combined with feature- and output-level losses, to effectively mimic localized attention patterns.
Experiments on SD1.5, SDXL, and PixArt demonstrate up to 6929× FLOPs reduction and improved inference latency while maintaining robust image quality and text-image alignment.

This paper, "Can We Achieve Efficient Diffusion without Self-Attention? Distilling Self-Attention into Convolutions" (2504.21292), investigates the role of self-attention in modern diffusion models and proposes a more efficient convolutional alternative. The core motivation stems from the observation that self-attention, a key component in U-Net and Diffusion Transformer (DiT) architectures, exhibits quadratic computational complexity with respect to image resolution. This becomes a major bottleneck for generating high-resolution images.

Contrary to the common assumption that self-attention is crucial for capturing global relationships, the authors' analysis reveals that self-attention in pre-trained diffusion models primarily exhibits localized attention patterns. They demonstrate this through visualizations of attention maps and quantitative analysis, showing that attention strength decays rapidly with distance from the query pixel (high-frequency, distance-dependent component) and also contains a broad, spatially invariant bias (low-frequency component). Effective Receptive Field (ERF) analysis further supports this, showing that ERFs are relatively small (e.g., below 15x15 or 20x20 for most layers in PixArt and SD1.5). An ablation study replacing all self-attention layers with Neighborhood Attention (a localized mechanism) also maintained image quality, suggesting global interactions aren't strictly necessary in every block.

Based on these findings, the paper proposes $\Delta$ ConvFusion, an architecture that replaces self-attention modules with a novel convolutional block called $\Delta$ ConvBlock. The $\Delta$ ConvBlock is specifically designed to mimic the two observed components of self-attention:

Pyramid Convolution: This branch captures the high-frequency, distance-dependent signal using a multi-scale convolutional structure. It processes the input feature map through multiple "pyramid stages" ( $\Delta_i$ ). Each stage involves downsampling (average pooling), a non-linear scaled simple gate (implemented with depth-wise convolutions to improve numerical stability), and upsampling (bilinear interpolation). By summing the outputs of stages with different downsampling factors ( $2^i$ ), the block achieves diverse receptive fields, allowing pixels closer to the center to accumulate higher weights, similar to the observed decay in self-attention. Given an input feature map $\tilde{\mathbf{z}}^l_t \in \mathbb{R}^{H' \times W' \times C'}$ , it is first passed through a $1 \times 1$ convolution $\boldsymbol{\psi}_{\text{in}}$ to reduce channels: $\mathbf{z}^{l}_{t,\text{in}} = \boldsymbol{\psi}_{\text{in}}(\tilde{\mathbf{z}}^l_t)$ . The pyramid convolution is then $\sum_{i=1}^{n} \uparrow2^i(\rho(\downarrow2^i(\mathbf{z}^{l}_{t,\text{in})))$, where $\Delta$ 0 is average pooling, $\Delta$ 1 is bilinear interpolation, and $\Delta$ 2 is the scaled simple gate $\Delta$ 3.
Average Pooling Branch: This component captures the low-frequency, spatially invariant bias. It consists of a global average pooling layer followed by a $\Delta$ 4 convolutional layer $\Delta$ 5. This operation provides a global context vector $\Delta$ 6 that is added to the output of the pyramid convolution branch.

The total output of the $\Delta$ 7ConvBlock is the sum of the outputs from the pyramid convolution branch and the average pooling branch, potentially followed by an output $\Delta$ 8 convolution.

For practical implementation, the authors propose an efficient training strategy: replacing self-attention blocks in a pre-trained diffusion model with initialized $\Delta$ 9ConvBlocks and only training the parameters of these new blocks, keeping the rest of the model frozen. Knowledge distillation is used to transfer the functionality of the original self-attention modules. This involves two loss terms:

Feature-level loss ( $\Delta$ 0): Minimizes the L2 distance between the outputs of the $\Delta$ 1ConvBlock and the original self-attention module for each layer. $\Delta$ 2.
Output-level loss ( $\Delta$ 3): Uses the $\Delta$ 4-prediction objective of the diffusion model, weighted by the Min-SNR loss to accelerate convergence. $\Delta$ 5. The overall loss is $\Delta$ 6.

Experiments were conducted on SD1.5 (U-Net), SDXL (U-Net), and PixArt (DiT) architectures. For SD1.5, $\Delta$ 7ConvBlocks used a kernel size of $\Delta$ 8, while for SDXL and PixArt, $\Delta$ 9 was used in the effective receptive field estimation guiding the design. Training was done on synthetic (Midjourney) and curated real images, and evaluation on high-aesthetic LAION images using DINOv2 Score (DS), Fréchet DINOv2 Distance (FDD), and CLIP Score.

The results demonstrate significant computational efficiency improvements:

FLOPs: $\Delta_i$ 0ConvFusion achieves drastically lower FLOPs compared to self-attention across all tested resolutions, reaching up to 6929 $\Delta_i$ 1 reduction at 16K for SD1.5. Compared to LinFusion (a state-of-the-art efficient method), $\Delta_i$ 2ConvFusion shows lower FLOPs at common resolutions (512x512, 1024x1024) and maintains competitive efficiency at higher resolutions while offering linear scalability.
Inference Latency: $\Delta_i$ 3ConvFusion shows notable speedups (e.g., 3.4 $\Delta_i$ 4 over LinFusion at 1024x1024), outperforming methods that still rely on less memory-efficient global mechanisms, even with larger effective kernel sizes. This indicates better memory efficiency in addition to FLOPs reduction.

Crucially, these efficiency gains do not come at the cost of performance:

Image Quality & Realism: $\Delta_i$ 5ConvFusion achieves comparable or superior DS (quality/diversity) and FDD (realism) scores compared to original self-attention models and LinFusion, across different base models and resolutions.
Text-Image Alignment: $\Delta_i$ 6ConvFusion shows superior CLIP scores, suggesting better semantic consistency with text prompts, implying the hierarchical features learned by $\Delta_i$ 7ConvBlocks are effective for this task.
Cross-Resolution Generation: $\Delta_i$ 8ConvFusion models trained at a lower resolution (e.g., 512x512) generalize much better to higher resolutions (e.g., 1024x1024) than their self-attention counterparts, which often produce fragmented results.

In summary, the paper presents a compelling case that the primary benefit of self-attention in diffusion models is its ability to model localized spatial interactions, which can be efficiently replicated by a structured convolutional design like the $\Delta_i$ 9ConvBlock. By distilling the behavior of self-attention into convolutions, $2^i$ 0ConvFusion offers a practical path to deploying efficient diffusion models that achieve high generation quality and semantic coherence with substantially reduced computational cost and latency, especially beneficial for high-resolution synthesis. The approach is modular and can be applied to replace self-attention in existing U-Net and DiT-based diffusion models through efficient distillation training.