Scale-wise Distillation of Diffusion Models

Published 20 Mar 2025 in cs.CV | (2503.16397v1)

Abstract: We present SwD, a scale-wise distillation framework for diffusion models (DMs), which effectively employs next-scale prediction ideas for diffusion-based few-step generators. In more detail, SwD is inspired by the recent insights relating diffusion processes to the implicit spectral autoregression. We suppose that DMs can initiate generation at lower data resolutions and gradually upscale the samples at each denoising step without loss in performance while significantly reducing computational costs. SwD naturally integrates this idea into existing diffusion distillation methods based on distribution matching. Also, we enrich the family of distribution matching approaches by introducing a novel patch loss enforcing finer-grained similarity to the target distribution. When applied to state-of-the-art text-to-image diffusion models, SwD approaches the inference times of two full resolution steps and significantly outperforms the counterparts under the same computation budget, as evidenced by automated metrics and human preference studies.

Abstract PDF Upgrade to Chat

Summary

An Overview of Scale-wise Distillation of Diffusion Models

The paper presents a novel framework entitled Scale-wise Distillation (SwD) for diffusion models (DMs), specifically focusing on enhancing efficiency in diffusion-based generative models for high-resolution image synthesis. The proposal draws upon insights concerning the implicit spectral autoregression of diffusion processes, suggesting significant computational improvements by initiating generation at reduced resolutions.

Framework Introduction and Methodology

SwD provides a unique modification to existing diffusion models by incorporating a scale-wise progression in the image generation process. The core idea rests on initiating sampling at reduced resolutions which are progressively upscaled at each denoising step. This approach aligns with the coarse-to-fine natural image generation process observed in human perception and eliminates redundant computations, typically required at high noise levels in traditional DMs.

The framework is integrated with existing diffusion distillation methods, capitalizing on distribution matching principles. A significant distinction of SwD involves a newly introduced patch loss objective, referred to as Patch Distribution Matching (PDM). This emphasizes fine-grained similarity to the target distribution and bolsters the generation of intricate image details, enhancing both computational efficiency and quality. By minimizing the distance between patch distributions rather than whole images, PDM leverages the superior expressive power of pretrained models' intermediate features.

Implementation and Results

The authors rigorously evaluate SwD on several state-of-the-art text-to-image diffusion models. The results indicate a tangible reduction in inference times—comparable to that of two full-resolution steps—and superior performance under constrained computational budgets. For instance, the generated images exhibit improved FID scores and human preference metrics. SwD achieves these results while being up to 10 times faster than comparable methodologies.

The application of SwD is primarily demonstrated using latent transformer-based diffusion models, specifically variants of the DiT architecture. Additional modifications include time schedule adjustments and reliance on synthetic data during training, which collectively contribute to the model's capability to optimize speed and quality simultaneously.

Theoretical and Practical Implications

The theoretical foundation of SwD rests on the spectral analysis of latent spaces, which suggests that diffusion processes can be effectively modeled at lower resolutions under high noise. Practically, this insight fosters the development of few-step generators that retain high fidelity images and reduce computational overhead. The dual-purpose functionality of the distilled models, serving as both efficient generators and image upscalers, merits particular attention for its versatility.

Potential Applications and Future Directions

The implications of SwD extend beyond conventional image generation tasks. Its application offers promising avenues for research in video generation, where temporal coherence and resolution dynamics are paramount. Additionally, the framework could inspire the development of adaptive scale scheduling techniques that dynamically adjust resolution based on image complexity or specific task requirements.

In conclusion, the scale-wise distillation framework represents a significant advancement in the efficiency of diffusion models, providing an elegant and computationally feasible approach to high-resolution image synthesis. The methodology's competitive performance positions it as a viable alternative or complement to existing generative strategies, marking a step forward in the ongoing evolution of diffusion-based models in artificial intelligence.

Markdown Report Issue