- The paper introduces SANA-Sprint, an ultra-fast text-to-image diffusion model that uses hybrid distillation to achieve high-quality image generation in 1-4 steps.
- SANA-Sprint employs a hybrid sCM and LADD distillation strategy and is a unified step-adaptive model capable of high-quality generation across 1 to 4 steps without step-specific training.
- Demonstrating state-of-the-art speed-quality, SANA-Sprint generates 1024x1024 images in 0.1 seconds on an H100, significantly outperforming prior methods like FLUX-schnell.
SANA-Sprint is an efficient diffusion model designed for ultra-fast text-to-image generation, leveraging a pre-trained foundation model (SANA) and incorporating hybrid distillation to drastically reduce inference steps from 20 to 1-4.
The method employs a hybrid distillation strategy comprising sCM (Simplified Continuous-Time Consistency Distillation) and LADD (Latent Adversarial Diffusion Distillation). sCM is a training-free approach that transforms a pre-trained flow-matching model into a TrigFlow model through a lossless mathematical transformation, negating the need for training from scratch. sCM ensures alignment with the teacher model (SANA). LADD is combined with sCM to enhance single-step generation fidelity, providing direct global supervision across different timesteps using an adversarial loss to improve convergence speed and output quality, where a discriminator is trained on features extracted from the teacher model in the latent space. The hybrid loss function is defined as L=LsCM​+λLadv​, with λ as a weighting factor (default 0.5), and an additional weighting at t=π/2 (max time) is added with probability p to improve one- and few-step generation.
SANA-Sprint is also a unified step-adaptive model capable of high-quality generation in 1-4 steps, eliminating the need for step-specific training and improving efficiency and flexibility. Timesteps for multi-step inference are optimized sequentially to maximize performance. It can be integrated with ControlNet for real-time interactive image generation, allowing for instant visual feedback for user interaction and enabling applications like real-time image editing, where Holistically-Nested Edge Detection (HED) scribbles are extracted from input images to guide the generation process.
To stabilize continuous-time distillation and address training instabilities and large gradient norms, SANA-Sprint refines the time-embedding and integrates QK-Normalization into attention mechanisms. A denser time-embedding, where the noise coefficient is cnoise​(t)=t rather than $1000t$, reduces gradient fluctuations and improves training stability. RMS normalization is applied to the Query and Key in self- and cross-attention modules (QK-Normalization) to stabilize training, especially when scaling up model size.
SANA-Sprint establishes a new Pareto frontier in speed-quality tradeoffs, generating 1024x1024 images in 0.1 seconds on an H100 GPU and 0.31 seconds on an RTX 4090. In a single step, SANA-Sprint achieves 7.59 FID and 0.74 GenEval, outperforming FLUX-schnell's 7.94 FID and 0.71 GenEval. Furthermore, SANA-Sprint is 10x faster than FLUX-schnell (0.1s vs 1.1s on H100) and specifically, SANA-Sprint is measured to be 64.7x faster than FLUX-schnell based on transformer latency on an A100. The method also facilitates near-real-time interaction with ControlNet at approximately 200 ms on H100 machines.
In summary, SANA-Sprint achieves rapid, high-quality image generation through hybrid distillation, a step-adaptive model, and integration with ControlNet, offering significant speed improvements and competitive performance metrics compared to existing methods.