Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
12 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
37 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation (2312.05239v7)

Published 8 Dec 2023 in cs.CV

Abstract: Despite their ability to generate high-resolution and diverse images from text prompts, text-to-image diffusion models often suffer from slow iterative sampling processes. Model distillation is one of the most effective directions to accelerate these models. However, previous distillation methods fail to retain the generation quality while requiring a significant amount of images for training, either from real data or synthetically generated by the teacher model. In response to this limitation, we present a novel image-free distillation scheme named $\textbf{SwiftBrush}$. Drawing inspiration from text-to-3D synthesis, in which a 3D neural radiance field that aligns with the input prompt can be obtained from a 2D text-to-image diffusion prior via a specialized loss without the use of any 3D data ground-truth, our approach re-purposes that same loss for distilling a pretrained multi-step text-to-image model to a student network that can generate high-fidelity images with just a single inference step. In spite of its simplicity, our model stands as one of the first one-step text-to-image generators that can produce images of comparable quality to Stable Diffusion without reliance on any training image data. Remarkably, SwiftBrush achieves an FID score of $\textbf{16.67}$ and a CLIP score of $\textbf{0.29}$ on the COCO-30K benchmark, achieving competitive results or even substantially surpassing existing state-of-the-art distillation techniques.

Citations (39)

Summary

  • The paper presents a novel one-step diffusion framework that leverages Variational Score Distillation to efficiently distill multi-step text-to-image models without relying on image data.
  • It achieves an FID score of 16.67 and a CLIP score of 0.29 on COCO benchmarks while generating high-quality images roughly 20 times faster than existing techniques.
  • The results highlight practical benefits for resource-constrained environments and suggest broad applicability in video, audio, and text generation tasks.

Summary of SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation

In this paper, the authors introduce SwiftBrush, an innovative one-step text-to-image diffusion model leveraging a novel image-free distillation strategy termed Variational Score Distillation (VSD). The central aim is to significantly enhance the efficiency of text-to-image diffusion models, which are traditionally slow due to their iterative sampling processes, without compromising the generation quality.

The proposed SwiftBrush model is primarily motivated by text-to-3D synthesis techniques, particularly the successful generation of 3D Neural Radiance Fields (NeRF) without 3D ground-truth data. In this context, the authors repurpose the loss function from text-to-3D synthesis to distill a multi-step text-to-image model into a one-step student network. This approach successfully avoids the reliance on image data during training, which is a major limitation of previous methods.

Quantitatively, SwiftBrush demonstrates impressive performance, achieving an FID score of 16.67 and a CLIP score of 0.29 on the COCO-30K benchmark. These results are either on par with or superior to existing state-of-the-art distillation techniques that often require extensive training data. Moreover, the authors report that the SwiftBrush model generates high-fidelity images approximately 20 times faster than Stable Diffusion, marking a substantial advancement in the deployment of diffusion models in resource-constrained environments such as consumer devices.

The authors thoroughly position their work within the broader landscape of diffusion models, highlighting its applicability across various domains—including video synthesis, audio generation, and text creation. They also provide insights on how techniques from text-to-3D model training can be effectively adapted to improve text-to-image diffusion processes.

The experimental section reveals a detailed evaluation methodology, where the SwiftBrush model is assessed against prominent benchmarks like MS COCO 2014 and Human Preference Score v2. The model's ability to maintain high-quality outputs in a one-step framework—previously limited to lower quality by existing methods—emphasizes the efficacy of the Variational Score Distillation approach.

In summary, SwiftBrush showcases a significant methodological leap in the field of diffusion models by advancing a novel, efficient one-step distillation methodology that eschews traditional training data dependencies. This positions it as a formidable approach in streamlining text-to-image synthesis, with promising implications for its application in broader AI domains. Future research directions could explore extending this framework to few-step generation scenarios to further enhance output quality, as well as examining the integration of specialized modules for user-driven customization possibilities in generated content.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com