Can We Achieve Efficient Diffusion without Self-Attention? Distilling Self-Attention into Convolutions (2504.21292v1)

Published 30 Apr 2025 in cs.CV

Abstract: Contemporary diffusion models built upon U-Net or Diffusion Transformer (DiT) architectures have revolutionized image generation through transformer-based attention mechanisms. The prevailing paradigm has commonly employed self-attention with quadratic computational complexity to handle global spatial relationships in complex images, thereby synthesizing high-fidelity images with coherent visual semantics.Contrary to conventional wisdom, our systematic layer-wise analysis reveals an interesting discrepancy: self-attention in pre-trained diffusion models predominantly exhibits localized attention patterns, closely resembling convolutional inductive biases. This suggests that global interactions in self-attention may be less critical than commonly assumed.Driven by this, we propose (\Delta)ConvFusion to replace conventional self-attention modules with Pyramid Convolution Blocks ((\Delta)ConvBlocks).By distilling attention patterns into localized convolutional operations while keeping other components frozen, (\Delta)ConvFusion achieves performance comparable to transformer-based counterparts while reducing computational cost by 6929$\times$ and surpassing LinFusion by 5.42$\times$ in efficiency--all without compromising generative fidelity.

Authors (6)

Ziyi Dong (5 papers)
Chengxing Zhou (3 papers)
Weijian Deng (24 papers)
Pengxu Wei (26 papers)
Xiangyang Ji (159 papers)
Liang Lin (318 papers)

Summary

This paper, "Can We Achieve Efficient Diffusion without Self-Attention? Distilling Self-Attention into Convolutions" (Dong et al., 30 Apr 2025 ), investigates the role of self-attention in modern diffusion models and proposes a more efficient convolutional alternative. The core motivation stems from the observation that self-attention, a key component in U-Net and Diffusion Transformer (DiT) architectures, exhibits quadratic computational complexity with respect to image resolution. This becomes a major bottleneck for generating high-resolution images.

Contrary to the common assumption that self-attention is crucial for capturing global relationships, the authors' analysis reveals that self-attention in pre-trained diffusion models primarily exhibits localized attention patterns. They demonstrate this through visualizations of attention maps and quantitative analysis, showing that attention strength decays rapidly with distance from the query pixel (high-frequency, distance-dependent component) and also contains a broad, spatially invariant bias (low-frequency component). Effective Receptive Field (ERF) analysis further supports this, showing that ERFs are relatively small (e.g., below 15x15 or 20x20 for most layers in PixArt and SD1.5). An ablation paper replacing all self-attention layers with Neighborhood Attention (a localized mechanism) also maintained image quality, suggesting global interactions aren't strictly necessary in every block.

Based on these findings, the paper proposes $\Delta$ ConvFusion, an architecture that replaces self-attention modules with a novel convolutional block called $\Delta$ ConvBlock. The $\Delta$ ConvBlock is specifically designed to mimic the two observed components of self-attention:

Pyramid Convolution: This branch captures the high-frequency, distance-dependent signal using a multi-scale convolutional structure. It processes the input feature map through multiple "pyramid stages" ( $\Delta_i$ ). Each stage involves downsampling (average pooling), a non-linear scaled simple gate (implemented with depth-wise convolutions to improve numerical stability), and upsampling (bilinear interpolation). By summing the outputs of stages with different downsampling factors ( $2^i$ ), the block achieves diverse receptive fields, allowing pixels closer to the center to accumulate higher weights, similar to the observed decay in self-attention. Given an input feature map $\tilde{\mathbf{z}}^l_t \in \mathbb{R}^{H' \times W' \times C'}$ , it is first passed through a $1 \times 1$ convolution $\boldsymbol{\psi}_{\text{in}}$ to reduce channels: $\mathbf{z}^{l}_{t,\text{in}} = \boldsymbol{\psi}_{\text{in}}(\tilde{\mathbf{z}}^l_t)$ . The pyramid convolution is then $\sum_{i=1}^{n} \uparrow2^i(\rho(\downarrow2^i(\mathbf{z}^{l}_{t,\text{in})))$, where $\downarrow2^i$ is average pooling, $\uparrow2^i$ is bilinear interpolation, and $\rho$ is the scaled simple gate $\rho(\mathbf{f}) = \frac{\mathbf{f}_{< C'/2} \cdot \mathbf{f}_{\geq C'/2}}{\sqrt{C'}}$ .
Average Pooling Branch: This component captures the low-frequency, spatially invariant bias. It consists of a global average pooling layer followed by a $1 \times 1$ convolutional layer $\boldsymbol{\psi}_p$ . This operation provides a global context vector $\mathbf{f}^{\text{avg}}_{\text{out}} = \boldsymbol{\psi}_p(\text{GlobalAvgPool}(\tilde{\mathbf{z}}^l_t))$ that is added to the output of the pyramid convolution branch.

The total output of the $\Delta$ ConvBlock is the sum of the outputs from the pyramid convolution branch and the average pooling branch, potentially followed by an output $1 \times 1$ convolution.

For practical implementation, the authors propose an efficient training strategy: replacing self-attention blocks in a pre-trained diffusion model with initialized $\Delta$ ConvBlocks and only training the parameters of these new blocks, keeping the rest of the model frozen. Knowledge distillation is used to transfer the functionality of the original self-attention modules. This involves two loss terms:

Feature-level loss ( $\mathcal{L}_f$ ): Minimizes the L2 distance between the outputs of the $\Delta$ ConvBlock and the original self-attention module for each layer. $\mathcal{L}_f = \sum_{l=1}^N \| \Delta^l_{\boldsymbol{\theta}}(\mathbf{z}^l_t) - \mathbf{z}^l_{t,\text{out}} \|^2$ .
Output-level loss ( $\mathcal{L}_z$ ): Uses the $\epsilon$ -prediction objective of the diffusion model, weighted by the Min-SNR loss to accelerate convergence. $\mathcal{L}_z = \min(\gamma \cdot (\sigma_t^z/\sigma_t^\epsilon)^2, 1) \cdot ( \|\tilde{\boldsymbol{\epsilon}} - \hat{\boldsymbol{\epsilon}}\|^2 + \|\boldsymbol{\epsilon} - \hat{\boldsymbol{\epsilon}}\|^2 )$ . The overall loss is $\mathcal{L} = \mathcal{L}_z + \beta \mathcal{L}_f$ .

Experiments were conducted on SD1.5 (U-Net), SDXL (U-Net), and PixArt (DiT) architectures. For SD1.5, $\Delta$ ConvBlocks used a kernel size of $K=9$ , while for SDXL and PixArt, $K=13$ was used in the effective receptive field estimation guiding the design. Training was done on synthetic (Midjourney) and curated real images, and evaluation on high-aesthetic LAION images using DINOv2 Score (DS), Fréchet DINOv2 Distance (FDD), and CLIP Score.

The results demonstrate significant computational efficiency improvements:

FLOPs: $\Delta$ ConvFusion achieves drastically lower FLOPs compared to self-attention across all tested resolutions, reaching up to 6929 $\times$ reduction at 16K for SD1.5. Compared to LinFusion (a state-of-the-art efficient method), $\Delta$ ConvFusion shows lower FLOPs at common resolutions (512x512, 1024x1024) and maintains competitive efficiency at higher resolutions while offering linear scalability.
Inference Latency: $\Delta$ ConvFusion shows notable speedups (e.g., 3.4 $\times$ over LinFusion at 1024x1024), outperforming methods that still rely on less memory-efficient global mechanisms, even with larger effective kernel sizes. This indicates better memory efficiency in addition to FLOPs reduction.

Crucially, these efficiency gains do not come at the cost of performance:

Image Quality & Realism: $\Delta$ ConvFusion achieves comparable or superior DS (quality/diversity) and FDD (realism) scores compared to original self-attention models and LinFusion, across different base models and resolutions.
Text-Image Alignment: $\Delta$ ConvFusion shows superior CLIP scores, suggesting better semantic consistency with text prompts, implying the hierarchical features learned by $\Delta$ ConvBlocks are effective for this task.
Cross-Resolution Generation: $\Delta$ ConvFusion models trained at a lower resolution (e.g., 512x512) generalize much better to higher resolutions (e.g., 1024x1024) than their self-attention counterparts, which often produce fragmented results.

In summary, the paper presents a compelling case that the primary benefit of self-attention in diffusion models is its ability to model localized spatial interactions, which can be efficiently replicated by a structured convolutional design like the $\Delta$ ConvBlock. By distilling the behavior of self-attention into convolutions, $\Delta$ ConvFusion offers a practical path to deploying efficient diffusion models that achieve high generation quality and semantic coherence with substantially reduced computational cost and latency, especially beneficial for high-resolution synthesis. The approach is modular and can be applied to replace self-attention in existing U-Net and DiT-based diffusion models through efficient distillation training.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/kwangmoo_yi/status/1918001281081188770

https://twitter.com/arxivsanitybot/status/1918299412868284810

Reddit

[2504.21292] Can We Achieve Efficient Diffusion without Self-Attention? Distilling Self-Attention into Convolutions (2 points, 0 comments)