Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers (2410.10629v3)

Published 14 Oct 2024 in cs.CV

Abstract: We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096$\times$4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: (1) Deep compression autoencoder: unlike traditional AEs, which compress images only 8$\times$, we trained an AE that can compress images 32$\times$, effectively reducing the number of latent tokens. (2) Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. (3) Decoder-only text encoder: we replaced T5 with modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment. (4) Efficient training and sampling: we propose Flow-DPM-Solver to reduce sampling steps, with efficient caption labeling and selection to accelerate convergence. As a result, Sana-0.6B is very competitive with modern giant diffusion model (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024$\times$1024 resolution image. Sana enables content creation at low cost. Code and model will be publicly released.

Overview of "Sana: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers"

Introduction

The paper "Sana: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers" proposes a novel framework named Sana, designed to efficiently generate high-resolution images up to 4096×4096 pixels. Sana addresses the computational inefficiencies of existing diffusion models by integrating several innovative components—including a deep compression autoencoder, linear attention mechanisms, and the deployment of a decoder-only LLM as a text encoder.

Key Contributions

  1. Deep Compression Autoencoder: Sana introduces a transformative autoencoder that compresses images 32×, as opposed to the conventional 8×. This significant reduction in latent tokens facilitates efficient training and high-resolution image generation.
  2. Linear DiT Architecture: The paper replaces traditional quadratic self-attention in Transformers with linear attention, reducing computational complexity from O(N²) to O(N). This change is critical for maintaining efficiency without quality degradation, particularly at higher resolutions, demonstrated by a 1.7× speed improvement at 4K resolution.
  3. Text Encoder Enhancements: Sana employs a decoder-only LLM—specifically Gemma—enhancing text comprehension and alignment with image generation. This framework stabilizes training and utilizes complex human instructions, improving semantic alignment between text and image outputs.
  4. Efficient Training and Sampling: A new Flow-DPM-Solver reduces sampling steps by half compared to traditional solvers, accelerating inference times while maintaining or improving performance.

Results and Impact

The Sana-0.6B model, with just 590 million parameters, demonstrates over 100× throughput improvement over state-of-the-art models like FLUX, generating 1024×1024 images in less than one second on a 16GB laptop GPU. This impressive performance is supported by competitive results on several benchmark metrics, including FID and CLIP Score.

Implications and Future Work

Sana represents a substantial leap forward in high-resolution, efficient image synthesis, potentially enabling wide adoption in practical applications where computational resources are limited, such as edge devices. The innovative integration of linear attention and advanced autoencoding methods shows promising directions for further research. Future developments could explore the extension of Sana's framework to video generation, enhancing the versatility of diffusion models in multimedia applications.

Conclusion

The paper presents a methodologically sound and experimentally verified framework that addresses existing inefficiencies in high-resolution image generation. By leveraging novel computational strategies, Sana sets a new benchmark in the field, balancing quality and efficiency to achieve scalable deployment. The advancements outlined in the paper offer significant insights into optimizing diffusion models, with implications that extend across both theoretical and applied dimensions of AI research.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Enze Xie (84 papers)
  2. Junsong Chen (13 papers)
  3. Junyu Chen (52 papers)
  4. Han Cai (79 papers)
  5. Yujun Lin (23 papers)
  6. Zhekai Zhang (11 papers)
  7. Muyang Li (23 papers)
  8. Yao Lu (212 papers)
  9. Song Han (155 papers)
  10. Haotian Tang (28 papers)
  11. Ligeng Zhu (22 papers)
Citations (6)
Youtube Logo Streamline Icon: https://streamlinehq.com