Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LinFusion: 1 GPU, 1 Minute, 16K Image (2409.02097v3)

Published 3 Sep 2024 in cs.CV and cs.LG

Abstract: Modern diffusion models, particularly those utilizing a Transformer-based UNet for denoising, rely heavily on self-attention operations to manage complex spatial relationships, thus achieving impressive generation performance. However, this existing paradigm faces significant challenges in generating high-resolution visual content due to its quadratic time and memory complexity with respect to the number of spatial tokens. To address this limitation, we aim at a novel linear attention mechanism as an alternative in this paper. Specifically, we begin our exploration from recently introduced models with linear complexity, e.g., Mamba2, RWKV6, Gated Linear Attention, etc, and identify two key features--attention normalization and non-causal inference--that enhance high-resolution visual generation performance. Building on these insights, we introduce a generalized linear attention paradigm, which serves as a low-rank approximation of a wide spectrum of popular linear token mixers. To save the training cost and better leverage pre-trained models, we initialize our models and distill the knowledge from pre-trained StableDiffusion (SD). We find that the distilled model, termed LinFusion, achieves performance on par with or superior to the original SD after only modest training, while significantly reducing time and memory complexity. Extensive experiments on SD-v1.5, SD-v2.1, and SD-XL demonstrate that LinFusion enables satisfactory and efficient zero-shot cross-resolution generation, accommodating ultra-resolution images like 16K on a single GPU. Moreover, it is highly compatible with pre-trained SD components and pipelines, such as ControlNet, IP-Adapter, DemoFusion, DistriFusion, etc, requiring no adaptation efforts. Codes are available at https://github.com/Huage001/LinFusion.

LinFusion: 1 GPU, 1 Minute, 16K Image

The paper presents LinFusion, a novel diffusion model designed to overcome the computational and memory challenges posed by high-resolution image generation. The core innovation in LinFusion stems from a generalized linear attention mechanism that replaces the traditional self-attention layers in diffusion models, specifically Stable Diffusion (SD), to achieve linear time and memory complexity with respect to the number of image pixels.

Key Innovations and Methodology

  1. Normalization-Aware Mamba:
    • The authors identify that existing models with linear complexity, such as Mamba2, face performance degradation in cross-resolution scenarios due to feature distribution shifts. To address this, the paper introduces a normalization mechanism ensuring consistent feature distributions across different resolutions. This adaptation is critical for maintaining high performance during zero-shot cross-resolution image generation.
  2. Non-Causal Inference:
    • Unlike auto-regressive tasks where tokens are processed sequentially, diffusion models allow simultaneous access to all tokens. The authors eliminate the causal restriction inherent in models like Mamba2 and develop a non-causal linear attention mechanism. This modification ensures that the model can efficiently handle spatial dependencies in high-resolution images without imposing unnecessary constraints.
  3. Implementation and Distillation:
    • The approach integrates LinFusion into the existing SD backbone by replacing self-attention layers with the proposed linear attention modules. The authors employ a knowledge distillation framework to initialize and train LinFusion, ensuring that it achieves performance on par with or superior to the original SD with significantly reduced computational resources.

Experimental Evaluation

The performance of LinFusion is validated through extensive experiments on multiple versions of SD, including SD-v1.5, SD-v2.1, and SD-XL. The results demonstrate that LinFusion not only matches but in some cases exceeds the performance of the original SD models while significantly reducing GPU memory consumption and running time.

  1. Efficiency and Memory Consumption:
    • LinFusion reduces GPU memory consumption and inference time substantially, making it feasible to generate 16K resolution images on a single GPU. For instance, LinFusion's GPU memory consumption significantly dropped to 4.43 GB, compared to 5.17 GB in the original SD for 512x512 resolution images.
  2. Cross-Resolution Performance:
    • The normalization mechanism in LinFusion played a crucial role in ensuring consistent performance across resolutions. For example, LinFusion demonstrated satisfactory zero-shot generalization on the COCO benchmark when generating images at 1024x1024 resolution, a scenario unseen during training.

Practical Implications and Future Directions

The research presents a significant step towards making high-resolution image generation more accessible and efficient. By reducing computational and memory constraints, LinFusion enables the use of advanced diffusion models on more modest hardware, thus broadening the potential applications of AI-generated content.

The normalized and non-causal linear attention mechanism proposed in LinFusion serves as a general framework that can be incorporated into various diffusion backbones. This opens up possibilities for further research into optimizing other architectures that traditionally rely on self-attention operations.

Compatibility with Existing Frameworks

One of LinFusion's strengths is its high degree of compatibility with existing components and plugins for SD. The authors demonstrate that LinFusion can seamlessly integrate with ControlNet and IP-Adapter without requiring additional adaptation or training. This ensures that users can leverage existing tools and workflows while benefiting from the enhanced performance and efficiency of LinFusion.

Conclusion

LinFusion addresses the inherent limitations of traditional diffusion models in high-resolution image generation through innovative linear attention mechanisms. The results indicate that LinFusion achieves superior efficiency and maintains high performance across different resolutions, making it a valuable contribution to the field of AI-generated content. Moving forward, this research opens avenues for further exploration into linear-complexity models and their applications in a wide range of visual generation tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
  2. All are worth words: A vit backbone for diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  22669–22679, 2023.
  3. xlstm: Extended long short-term memory. arXiv preprint arXiv:2405.04517, 2024.
  4. Improving image generation with better captions. 2023. URL https://api.semanticscholar.org/CorpusID:264403242.
  5. Emerging properties in self-supervised vision transformers. In Proceedings of the International Conference on Computer Vision (ICCV), 2021.
  6. Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023.
  7. Kyunghyun Cho. Learning phrase representations using rnn encoder–decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
  8. Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  9. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. arXiv preprint arXiv:2405.21060, 2024.
  10. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  11. Cogview2: Faster and better text-to-image generation via hierarchical transformers. Advances in Neural Information Processing Systems, 35:16890–16902, 2022.
  12. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024.
  13. Scalable diffusion models with state space backbone. arXiv preprint arXiv:2402.05608, 2024a.
  14. Diffusion-rwkv: Scaling rwkv-like architectures for diffusion models. arXiv preprint arXiv:2404.04478, 2024b.
  15. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
  16. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  17. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  18. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021.
  19. Demystify mamba in vision: A linear attention perspective. arXiv preprint arXiv:2405.16605, 2024.
  20. Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models. In The Twelfth International Conference on Learning Representations, 2024.
  21. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  22. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  23. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  24. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  25. Zigma: A dit-style zigzag mamba diffusion model. arXiv preprint arXiv:2403.13802, 2024.
  26. Fouriscale: A frequency perspective on training-free high-resolution image synthesis. arXiv preprint arXiv:2403.12963, 2024.
  27. Scaling up gans for text-to-image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  28. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pp.  5156–5165. PMLR, 2020.
  29. Tobias Katsch. Gateloop: Fully data-controlled linear recurrence for sequence modeling. arXiv preprint arXiv:2311.01927, 2023.
  30. Bk-sdm: Architecturally compressed stable diffusion for efficient text-to-image generation. In Workshop on Efficient Systems for Foundation Models@ ICML2023, 2023a.
  31. Consistency trajectory models: Learning probability flow ode trajectory of diffusion. arXiv preprint arXiv:2310.02279, 2023b.
  32. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pp.  19730–19742. PMLR, 2023.
  33. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp.  740–755. Springer, 2014.
  34. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  35. Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378, 2023a.
  36. Lcm-lora: A universal stable-diffusion acceleration module. arXiv preprint arXiv:2311.05556, 2023b.
  37. Deepcache: Accelerating diffusion models for free. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  15762–15772, 2024.
  38. Huanru Henry Mao. Fine-tuning pre-trained transformers into decaying fast weights. arXiv preprint arXiv:2210.04243, 2022.
  39. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
  40. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  41. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pp.  8162–8171. PMLR, 2021.
  42. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  4195–4205, 2023.
  43. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023.
  44. Eagle and finch: Rwkv with matrix-valued states and dynamic recurrence. arXiv preprint arXiv:2404.05892, 2024.
  45. Random feature attention. arXiv preprint arXiv:2103.02143, 2021.
  46. Recurrent linear transformers. arXiv preprint arXiv:2310.15719, 2023.
  47. Hgrn2: Gated linear rnns with state expansion. arXiv preprint arXiv:2404.07904, 2024.
  48. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  49. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  50. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10684–10695, 2022.
  51. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pp.  234–241. Springer, 2015.
  52. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  22500–22510, 2023.
  53. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  54. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  55. Vssd: Vision mamba with non-casual state space duality. arXiv preprint arXiv:2407.18559, 2024.
  56. Consistency models. arXiv preprint arXiv:2303.01469, 2023.
  57. You only cache once: Decoder-decoder architectures for language models. arXiv preprint arXiv:2405.05254, 2024.
  58. Dim: Diffusion mamba for efficient high-resolution image synthesis. arXiv preprint arXiv:2405.14224, 2024.
  59. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  60. Diffusion models without attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  8239–8249, 2024.
  61. Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 56(4):1–39, 2023a.
  62. Gated linear attention transformers with hardware-efficient training. arXiv preprint arXiv:2312.06635, 2023b.
  63. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. 2023.
  64. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
  65. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  3836–3847, 2023.
  66. Score identity distillation: Exponentially fast distillation of pretrained diffusion models for one-step generation. In Forty-first International Conference on Machine Learning, 2024.
  67. Dig: Scalable and efficient diffusion models with gated linear attention. arXiv preprint arXiv:2405.18428, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Songhua Liu (33 papers)
  2. Weihao Yu (36 papers)
  3. Zhenxiong Tan (14 papers)
  4. Xinchao Wang (203 papers)
Citations (5)
Reddit Logo Streamline Icon: https://streamlinehq.com