Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 73 tok/s

Gemini 2.5 Pro 41 tok/s Pro

GPT-5 Medium 32 tok/s Pro

GPT-5 High 35 tok/s Pro

GPT-4o 84 tok/s Pro

Kimi K2 185 tok/s Pro

GPT OSS 120B 441 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

SANA-Sprint: One-Step Diffusion with Continuous-Time Consistency Distillation (2503.09641v3)

Published 12 Mar 2025 in cs.GR

Abstract: This paper presents SANA-Sprint, an efficient diffusion model for ultra-fast text-to-image (T2I) generation. SANA-Sprint is built on a pre-trained foundation model and augmented with hybrid distillation, dramatically reducing inference steps from 20 to 1-4. We introduce three key innovations: (1) We propose a training-free approach that transforms a pre-trained flow-matching model for continuous-time consistency distillation (sCM), eliminating costly training from scratch and achieving high training efficiency. Our hybrid distillation strategy combines sCM with latent adversarial distillation (LADD): sCM ensures alignment with the teacher model, while LADD enhances single-step generation fidelity. (2) SANA-Sprint is a unified step-adaptive model that achieves high-quality generation in 1-4 steps, eliminating step-specific training and improving efficiency. (3) We integrate ControlNet with SANA-Sprint for real-time interactive image generation, enabling instant visual feedback for user interaction. SANA-Sprint establishes a new Pareto frontier in speed-quality tradeoffs, achieving state-of-the-art performance with 7.59 FID and 0.74 GenEval in only 1 step - outperforming FLUX-schnell (7.94 FID / 0.71 GenEval) while being 10x faster (0.1s vs 1.1s on H100). It also achieves 0.1s (T2I) and 0.25s (ControlNet) latency for 1024 x 1024 images on H100, and 0.31s (T2I) on an RTX 4090, showcasing its exceptional efficiency and potential for AI-powered consumer applications (AIPC). Code and pre-trained models will be open-sourced.

Summary

The paper introduces SANA-Sprint, an ultra-fast text-to-image diffusion model that uses hybrid distillation to achieve high-quality image generation in 1-4 steps.
SANA-Sprint employs a hybrid sCM and LADD distillation strategy and is a unified step-adaptive model capable of high-quality generation across 1 to 4 steps without step-specific training.
Demonstrating state-of-the-art speed-quality, SANA-Sprint generates 1024x1024 images in 0.1 seconds on an H100, significantly outperforming prior methods like FLUX-schnell.

SANA-Sprint is an efficient diffusion model designed for ultra-fast text-to-image generation, leveraging a pre-trained foundation model (SANA) and incorporating hybrid distillation to drastically reduce inference steps from 20 to 1-4.

The method employs a hybrid distillation strategy comprising sCM (Simplified Continuous-Time Consistency Distillation) and LADD (Latent Adversarial Diffusion Distillation). sCM is a training-free approach that transforms a pre-trained flow-matching model into a TrigFlow model through a lossless mathematical transformation, negating the need for training from scratch. sCM ensures alignment with the teacher model (SANA). LADD is combined with sCM to enhance single-step generation fidelity, providing direct global supervision across different timesteps using an adversarial loss to improve convergence speed and output quality, where a discriminator is trained on features extracted from the teacher model in the latent space. The hybrid loss function is defined as $L = L_{sCM} + \lambda L_{adv}$ , with $\lambda$ as a weighting factor (default 0.5), and an additional weighting at $t = \pi/2$ (max time) is added with probability $p$ to improve one- and few-step generation.

SANA-Sprint is also a unified step-adaptive model capable of high-quality generation in 1-4 steps, eliminating the need for step-specific training and improving efficiency and flexibility. Timesteps for multi-step inference are optimized sequentially to maximize performance. It can be integrated with ControlNet for real-time interactive image generation, allowing for instant visual feedback for user interaction and enabling applications like real-time image editing, where Holistically-Nested Edge Detection (HED) scribbles are extracted from input images to guide the generation process.

To stabilize continuous-time distillation and address training instabilities and large gradient norms, SANA-Sprint refines the time-embedding and integrates QK-Normalization into attention mechanisms. A denser time-embedding, where the noise coefficient is $c_{noise}(t) = t$ rather than $1000t$, reduces gradient fluctuations and improves training stability. RMS normalization is applied to the Query and Key in self- and cross-attention modules (QK-Normalization) to stabilize training, especially when scaling up model size.

SANA-Sprint establishes a new Pareto frontier in speed-quality tradeoffs, generating 1024x1024 images in 0.1 seconds on an H100 GPU and 0.31 seconds on an RTX 4090. In a single step, SANA-Sprint achieves 7.59 FID and 0.74 GenEval, outperforming FLUX-schnell's 7.94 FID and 0.71 GenEval. Furthermore, SANA-Sprint is 10x faster than FLUX-schnell (0.1s vs 1.1s on H100) and specifically, SANA-Sprint is measured to be 64.7x faster than FLUX-schnell based on transformer latency on an A100. The method also facilitates near-real-time interaction with ControlNet at approximately 200 ms on H100 machines.

In summary, SANA-Sprint achieves rapid, high-quality image generation through hybrid distillation, a step-adaptive model, and integration with ControlNet, offering significant speed improvements and competitive performance metrics compared to existing methods.