Papers
Topics
Authors
Recent
2000 character limit reached

SDXS Models: Real-Time One-Step Diffusion

Updated 17 November 2025
  • SDXS models are one-step latent diffusion frameworks that compress traditional U-Net and VAE components for rapid high-resolution image synthesis.
  • They achieve near real-time speeds (up to 100 FPS at 512²) by replacing iterative denoising with a single-step generator and a distilled architecture.
  • Conditional synthesis via ControlNet distillation supports image-to-image tasks with competitive fidelity while reducing computational cost.

SDXS models are real-time, one-step latent diffusion models for high-resolution image synthesis with image conditions, combining aggressive architecture miniaturization and a novel training objective to deliver near-GAN inference speeds in the Latent Diffusion Model (LDM) paradigm. The SDXS framework achieves this by distilling and shrinking traditional U-Net and VAE decoder components and reducing the iterative denoising process of standard diffusion models to a single-step generator. SDXS supports conditional image-to-image tasks through ControlNet distillation and demonstrates competitive fidelity and alignment metrics at a fraction of the computational cost of previous methods (Song et al., 25 Mar 2024).

1. Framework Objectives and Distinction from Standard Diffusion Models

The central aim of SDXS is to enable high-quality, high-resolution image synthesis at real-time speed (up to 100 FPS at 512×512 and 30 FPS at 1024×1024) using a latent diffusion architecture. Unlike conventional models such as Stable Diffusion (SD v1.5 or SDXL), which require 16–32 network function evaluations (NFEs) through large U-Nets and heavy VAE decoders resulting in significant latency (hundreds of milliseconds per image), SDXS uses:

  • A tiny CNN-based VAE decoder (≈1.2 M parameters; replaces typical 50 M-parameter decoders)
  • A compact, block-pruned U-Net (0.32 B or 0.74 B parameters depending on resolution; replaces 0.87–2.56 B parameter baselines)
  • A single-step generator trained with a custom two-term objective (feature matching and score distillation), enabling ∼100 FPS at 512² and ∼30 FPS at 1024².

This approach collapses the iterative diffusion process into a one-shot mapping, thus drastically reducing inference latency while supporting image-conditioned tasks (e.g., canny-to-image or depth-to-image translation).

2. Architecture Compression via Knowledge Distillation

SDXS achieves model compactness through a multi-stage knowledge distillation process targeted at both the VAE decoder and U-Net backbone.

2.1 Tiny VAE Decoder

A lightweight decoder G(z)G(z) is trained to mimic the output x~\tilde x of a pretrained VAE, given latent codes zz. The VAE distillation loss is

LVD=G(z)8×x~8×1+λGANLGAN(G(z),x~,D)\mathcal{L}_{VD} = \| G(z)_{\downarrow 8\times} - \tilde x_{\downarrow 8\times} \|_1 + \lambda_{GAN} \mathcal{L}_{GAN}(G(z), \tilde x, D)

where DD is the discriminator, and L1L_1 reconstruction focuses on 8×8\times-downsampled images. The architecture consists solely of residual blocks and upsampling layers, omitting normalization and self-attention for maximum efficiency.

2.2 Block-Removal Distilled U-Net

Starting from a pretrained teacher U-Net ϕ\phi, entire residual or transformer blocks are removed to create the student U-Net θ\theta. Two knowledge distillation (KD) losses are leveraged:

  • Output Knowledge Distillation (OKD):

LOKD=0TEx0,ϵ,tsθ(xt,t)sϕ(xt,t)22dt\mathcal{L}_{OKD} = \int_0^T \mathbb{E}_{x_0,\epsilon,t} \| s_\theta(x_t, t) - s_\phi(x_t, t) \|^2_2 \, dt

  • Feature Knowledge Distillation (FKD):

LFKD=0TEx0,ϵ,tlfθl(xt,t)fϕl(xt,t)22dt\mathcal{L}_{FKD} = \int_0^T \mathbb{E}_{x_0,\epsilon,t} \sum_{l} \|f^l_\theta(x_t,t) - f^l_\phi(x_t,t)\|^2_2 dt

The combined distillation loss is LKD=LOKD+λFLFKD\mathcal{L}_{KD} = \mathcal{L}_{OKD} + \lambda_F \mathcal{L}_{FKD}. For SDXS-512, specific block removals include the middle U-Net stage, last downsample and first upsample stages, and pruning of high-res transformers. For SDXS-1024, most transformer blocks are removed while preserving basic conditional capacity.

2.3 ControlNet Distillation

ControlNet extends U-Net by duplicating the encoder and appending “zero-convs” for spatial controls. The teacher’s ControlNet is integrated into both teacher and student pipelines, with distillation (OKD/FKD) applied to the decoder, yielding a control-capable one-step student.

3. One-Step Diffusion Training: Feature Matching and Score Distillation

The training objective for the one-step generator xθ(z)x_\theta(z) comprises two key phases:

3.1 Feature Matching Warmup

High-level feature matching is applied as a warmup phase to circumvent blurry averages from crossing ODE trajectories. Intermediate features flf^l are extracted from both student and teacher outputs using a frozen encoder backbone, with multiscale similarity (via SSIM) computed per layer:

LFM=lwl  SSIM(fl(xθ(ϵ)),fl(ψ(xϕ(ϵ))))\mathcal{L}_{FM} = \sum_l w_l \; \mathrm{SSIM}(f^l(x_\theta(\epsilon)), f^l(\psi(x_\phi(\epsilon))))

This encourages perceptually sharp reconstructions during the initial steps of training.

3.2 Segmented Score Distillation (SSD)

The student’s marginal score sθ(xt,t)s_\theta(x_t, t) is aligned with the teacher network’s score spt(xt)s_{p_t}(x_t), adopting an Inverse KL (IKL) gradient formulation:

Grad(θ)=0Tw(t)  Ez,xt[sθ(xt,t)spt(xt)]xtθdt\operatorname{Grad}(\theta) = \int_{0}^{T} w(t) \; \mathbb{E}_{z,x_t} \left[ s_\theta(x_t, t) - s_{p_t}(x_t) \right] \frac{\partial x_t}{\partial\theta} \, dt

A time segmentation (αT)(\alpha T) divides the backpropagation between direct score-matching (for tαTt \leq \alpha T) and feature-matching gradients (for t>αTt > \alpha T):

Grad(θ)0αTF(t,xt)dt+λFMLFMθ\operatorname{Grad}(\theta) \approx \int_{0}^{\alpha T} F(t, x_t) dt + \lambda_{FM} \frac{\partial \mathcal{L}_{FM}}{\partial \theta}

Where F(t,xt)F(t,x_t) encapsulates the weighted difference in scores between student and teacher. α0\alpha \to 0 and λFM0\lambda_{FM} \to 0 ensure convergence to the teacher’s one-step marginal.

3.3 Formal Loss Expressions

The auxiliary losses used are:

LFM(2)=Ezfteacher(z)fstudent(z)22,\mathcal{L}_{FM}^{(2)} = \mathbb{E}_z \| f_{teacher}(z) - f_{student}(z) \|_2^2,

LSD=Ext,tsteacher(xt,t)sstudent(xt,t)22\mathcal{L}_{SD} = \mathbb{E}_{x_t,t} \| s_{teacher}(x_t,t) - s_{student}(x_t,t) \|_2^2

4. Quantitative Performance and Efficiency

Empirical results establish that SDXS models achieve substantial speedups over traditional diffusion baselines, with modest trade-offs in quality metrics:

Model Res. Steps U-Net Size FID CLIP Latency (ms) FPS
SD v1.5 512² 16 860 M 24.3 31.8 276 3.6
SDXS-512 512² 1 319 M 28.2 32.8 9 110
SDXL 1024² 32 2.56 B 24.6 33.8 1,869 0.54
SDXS-1024 1024² 1 0.74 B 30.9 32.3 32 30

Model shrinkage (U-Net ×2–4 smaller) induces a ~4–6 FID point degradation but may retain or improve CLIP alignment. Collapsing sampling to a single step costs an additional ~5–6 FID points relative to 32-step baselines but enables near orders-of-magnitude speed gains.

Ablation studies confirm that the inclusion of feature matching (perceptual/SSIM) sharpens signficantly over plain MSE, while full segmented score distillation (SSD) yields the best fidelity/speed trade-off.

5. Conditional Synthesis and Image-to-Image Translation

By joint ControlNet distillation, SDXS supports robust one-step image-conditioned synthesis (e.g., canny \to image, depth \to image). Training pairs are generated by extracting a control map cc from a teacher-synthesized image x0x_0 and presenting (z,c)(z, c) to the student. The SSD objective simultaneously aligns distributions and enforces structural control.

Experimental results demonstrate accurate structural preservation in conditional synthesis, though sample diversity is noted to decrease—a trade-off for future work.

6. Implications and Future Directions

SDXS exemplifies the feasibility of compressing diffusion models to real-time, one-step architectures with competitive generation quality, particularly for deployment contexts where low latency is critical. The architecture’s flexibility for conditioned generation also encourages exploration in real-time image editing, vision-based control, and resource-constrained inference.

A plausible implication is that the segmented score distillation approach could potentially generalize to other domains where matching high-level distributional statistics is preferable to pixel-level losses or raw MSE. The observed reduction in sample diversity suggests a need for further research into preserving generative uncertainty under aggressive distillation and sampling compression (Song et al., 25 Mar 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to SDXS Models.