Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

41 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

41 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

344 1 237

LinFusion: 1 GPU, 1 Minute, 16K Image (2409.02097v3)

Published 3 Sep 2024 in cs.CV and cs.LG

Abstract: Modern diffusion models, particularly those utilizing a Transformer-based UNet for denoising, rely heavily on self-attention operations to manage complex spatial relationships, thus achieving impressive generation performance. However, this existing paradigm faces significant challenges in generating high-resolution visual content due to its quadratic time and memory complexity with respect to the number of spatial tokens. To address this limitation, we aim at a novel linear attention mechanism as an alternative in this paper. Specifically, we begin our exploration from recently introduced models with linear complexity, e.g., Mamba2, RWKV6, Gated Linear Attention, etc, and identify two key features--attention normalization and non-causal inference--that enhance high-resolution visual generation performance. Building on these insights, we introduce a generalized linear attention paradigm, which serves as a low-rank approximation of a wide spectrum of popular linear token mixers. To save the training cost and better leverage pre-trained models, we initialize our models and distill the knowledge from pre-trained StableDiffusion (SD). We find that the distilled model, termed LinFusion, achieves performance on par with or superior to the original SD after only modest training, while significantly reducing time and memory complexity. Extensive experiments on SD-v1.5, SD-v2.1, and SD-XL demonstrate that LinFusion enables satisfactory and efficient zero-shot cross-resolution generation, accommodating ultra-resolution images like 16K on a single GPU. Moreover, it is highly compatible with pre-trained SD components and pipelines, such as ControlNet, IP-Adapter, DemoFusion, DistriFusion, etc, requiring no adaptation efforts. Codes are available at https://github.com/Huage001/LinFusion.

PDF HTML Abstract

LinFusion: 1 GPU, 1 Minute, 16K Image

The paper presents LinFusion, a novel diffusion model designed to overcome the computational and memory challenges posed by high-resolution image generation. The core innovation in LinFusion stems from a generalized linear attention mechanism that replaces the traditional self-attention layers in diffusion models, specifically Stable Diffusion (SD), to achieve linear time and memory complexity with respect to the number of image pixels.

Key Innovations and Methodology

Normalization-Aware Mamba:
- The authors identify that existing models with linear complexity, such as Mamba2, face performance degradation in cross-resolution scenarios due to feature distribution shifts. To address this, the paper introduces a normalization mechanism ensuring consistent feature distributions across different resolutions. This adaptation is critical for maintaining high performance during zero-shot cross-resolution image generation.
Non-Causal Inference:
- Unlike auto-regressive tasks where tokens are processed sequentially, diffusion models allow simultaneous access to all tokens. The authors eliminate the causal restriction inherent in models like Mamba2 and develop a non-causal linear attention mechanism. This modification ensures that the model can efficiently handle spatial dependencies in high-resolution images without imposing unnecessary constraints.
Implementation and Distillation:
- The approach integrates LinFusion into the existing SD backbone by replacing self-attention layers with the proposed linear attention modules. The authors employ a knowledge distillation framework to initialize and train LinFusion, ensuring that it achieves performance on par with or superior to the original SD with significantly reduced computational resources.

Experimental Evaluation

The performance of LinFusion is validated through extensive experiments on multiple versions of SD, including SD-v1.5, SD-v2.1, and SD-XL. The results demonstrate that LinFusion not only matches but in some cases exceeds the performance of the original SD models while significantly reducing GPU memory consumption and running time.

Efficiency and Memory Consumption:
- LinFusion reduces GPU memory consumption and inference time substantially, making it feasible to generate 16K resolution images on a single GPU. For instance, LinFusion's GPU memory consumption significantly dropped to 4.43 GB, compared to 5.17 GB in the original SD for 512x512 resolution images.
Cross-Resolution Performance:
- The normalization mechanism in LinFusion played a crucial role in ensuring consistent performance across resolutions. For example, LinFusion demonstrated satisfactory zero-shot generalization on the COCO benchmark when generating images at 1024x1024 resolution, a scenario unseen during training.

Practical Implications and Future Directions

The research presents a significant step towards making high-resolution image generation more accessible and efficient. By reducing computational and memory constraints, LinFusion enables the use of advanced diffusion models on more modest hardware, thus broadening the potential applications of AI-generated content.

The normalized and non-causal linear attention mechanism proposed in LinFusion serves as a general framework that can be incorporated into various diffusion backbones. This opens up possibilities for further research into optimizing other architectures that traditionally rely on self-attention operations.

Compatibility with Existing Frameworks

One of LinFusion's strengths is its high degree of compatibility with existing components and plugins for SD. The authors demonstrate that LinFusion can seamlessly integrate with ControlNet and IP-Adapter without requiring additional adaptation or training. This ensures that users can leverage existing tools and workflows while benefiting from the enhanced performance and efficiency of LinFusion.

Conclusion

LinFusion addresses the inherent limitations of traditional diffusion models in high-resolution image generation through innovative linear attention mechanisms. The results indicate that LinFusion achieves superior efficiency and maintains high performance across different resolutions, making it a valuable contribution to the field of AI-generated content. Moving forward, this research opens avenues for further exploration into linear-complexity models and their applications in a wide range of visual generation tasks.

PDF Markdown Bookmark Chat (Pro)

References (67)

Authors (4)

Songhua Liu (33 papers)
Weihao Yu (36 papers)
Zhenxiong Tan (14 papers)
Xinchao Wang (203 papers)

Citations (5)

View on Semantic Scholar

GitHub

GitHub - Huage001/LinFusion: Official PyTorch and Diffusers Implementation of "LinFusion: 1 GPU, 1 Minute, 16K Image" (237 stars)

Tweets

https://twitter.com/songhua_tree/status/1833700783847854290

https://twitter.com/dreamingtulpa/status/1835646262219075681

https://twitter.com/fffiloni/status/1833208503793815907

[2409.02097] LinFusion: 1 GPU, 1 Minute, 16K Image (1 point, 0 comments)