SDXS Models: Real-Time One-Step Diffusion
- SDXS models are one-step latent diffusion frameworks that compress traditional U-Net and VAE components for rapid high-resolution image synthesis.
- They achieve near real-time speeds (up to 100 FPS at 512²) by replacing iterative denoising with a single-step generator and a distilled architecture.
- Conditional synthesis via ControlNet distillation supports image-to-image tasks with competitive fidelity while reducing computational cost.
SDXS models are real-time, one-step latent diffusion models for high-resolution image synthesis with image conditions, combining aggressive architecture miniaturization and a novel training objective to deliver near-GAN inference speeds in the Latent Diffusion Model (LDM) paradigm. The SDXS framework achieves this by distilling and shrinking traditional U-Net and VAE decoder components and reducing the iterative denoising process of standard diffusion models to a single-step generator. SDXS supports conditional image-to-image tasks through ControlNet distillation and demonstrates competitive fidelity and alignment metrics at a fraction of the computational cost of previous methods (Song et al., 25 Mar 2024).
1. Framework Objectives and Distinction from Standard Diffusion Models
The central aim of SDXS is to enable high-quality, high-resolution image synthesis at real-time speed (up to 100 FPS at 512×512 and 30 FPS at 1024×1024) using a latent diffusion architecture. Unlike conventional models such as Stable Diffusion (SD v1.5 or SDXL), which require 16–32 network function evaluations (NFEs) through large U-Nets and heavy VAE decoders resulting in significant latency (hundreds of milliseconds per image), SDXS uses:
- A tiny CNN-based VAE decoder (≈1.2 M parameters; replaces typical 50 M-parameter decoders)
- A compact, block-pruned U-Net (0.32 B or 0.74 B parameters depending on resolution; replaces 0.87–2.56 B parameter baselines)
- A single-step generator trained with a custom two-term objective (feature matching and score distillation), enabling ∼100 FPS at 512² and ∼30 FPS at 1024².
This approach collapses the iterative diffusion process into a one-shot mapping, thus drastically reducing inference latency while supporting image-conditioned tasks (e.g., canny-to-image or depth-to-image translation).
2. Architecture Compression via Knowledge Distillation
SDXS achieves model compactness through a multi-stage knowledge distillation process targeted at both the VAE decoder and U-Net backbone.
2.1 Tiny VAE Decoder
A lightweight decoder is trained to mimic the output of a pretrained VAE, given latent codes . The VAE distillation loss is
where is the discriminator, and reconstruction focuses on -downsampled images. The architecture consists solely of residual blocks and upsampling layers, omitting normalization and self-attention for maximum efficiency.
2.2 Block-Removal Distilled U-Net
Starting from a pretrained teacher U-Net , entire residual or transformer blocks are removed to create the student U-Net . Two knowledge distillation (KD) losses are leveraged:
- Output Knowledge Distillation (OKD):
- Feature Knowledge Distillation (FKD):
The combined distillation loss is . For SDXS-512, specific block removals include the middle U-Net stage, last downsample and first upsample stages, and pruning of high-res transformers. For SDXS-1024, most transformer blocks are removed while preserving basic conditional capacity.
2.3 ControlNet Distillation
ControlNet extends U-Net by duplicating the encoder and appending “zero-convs” for spatial controls. The teacher’s ControlNet is integrated into both teacher and student pipelines, with distillation (OKD/FKD) applied to the decoder, yielding a control-capable one-step student.
3. One-Step Diffusion Training: Feature Matching and Score Distillation
The training objective for the one-step generator comprises two key phases:
3.1 Feature Matching Warmup
High-level feature matching is applied as a warmup phase to circumvent blurry averages from crossing ODE trajectories. Intermediate features are extracted from both student and teacher outputs using a frozen encoder backbone, with multiscale similarity (via SSIM) computed per layer:
This encourages perceptually sharp reconstructions during the initial steps of training.
3.2 Segmented Score Distillation (SSD)
The student’s marginal score is aligned with the teacher network’s score , adopting an Inverse KL (IKL) gradient formulation:
A time segmentation divides the backpropagation between direct score-matching (for ) and feature-matching gradients (for ):
Where encapsulates the weighted difference in scores between student and teacher. and ensure convergence to the teacher’s one-step marginal.
3.3 Formal Loss Expressions
The auxiliary losses used are:
4. Quantitative Performance and Efficiency
Empirical results establish that SDXS models achieve substantial speedups over traditional diffusion baselines, with modest trade-offs in quality metrics:
| Model | Res. | Steps | U-Net Size | FID | CLIP | Latency (ms) | FPS |
|---|---|---|---|---|---|---|---|
| SD v1.5 | 512² | 16 | 860 M | 24.3 | 31.8 | 276 | 3.6 |
| SDXS-512 | 512² | 1 | 319 M | 28.2 | 32.8 | 9 | 110 |
| SDXL | 1024² | 32 | 2.56 B | 24.6 | 33.8 | 1,869 | 0.54 |
| SDXS-1024 | 1024² | 1 | 0.74 B | 30.9 | 32.3 | 32 | 30 |
Model shrinkage (U-Net ×2–4 smaller) induces a ~4–6 FID point degradation but may retain or improve CLIP alignment. Collapsing sampling to a single step costs an additional ~5–6 FID points relative to 32-step baselines but enables near orders-of-magnitude speed gains.
Ablation studies confirm that the inclusion of feature matching (perceptual/SSIM) sharpens signficantly over plain MSE, while full segmented score distillation (SSD) yields the best fidelity/speed trade-off.
5. Conditional Synthesis and Image-to-Image Translation
By joint ControlNet distillation, SDXS supports robust one-step image-conditioned synthesis (e.g., canny image, depth image). Training pairs are generated by extracting a control map from a teacher-synthesized image and presenting to the student. The SSD objective simultaneously aligns distributions and enforces structural control.
Experimental results demonstrate accurate structural preservation in conditional synthesis, though sample diversity is noted to decrease—a trade-off for future work.
6. Implications and Future Directions
SDXS exemplifies the feasibility of compressing diffusion models to real-time, one-step architectures with competitive generation quality, particularly for deployment contexts where low latency is critical. The architecture’s flexibility for conditioned generation also encourages exploration in real-time image editing, vision-based control, and resource-constrained inference.
A plausible implication is that the segmented score distillation approach could potentially generalize to other domains where matching high-level distributional statistics is preferable to pixel-level losses or raw MSE. The observed reduction in sample diversity suggests a need for further research into preserving generative uncertainty under aggressive distillation and sampling compression (Song et al., 25 Mar 2024).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free