SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training (2412.09619v1)

Published 12 Dec 2024 in cs.CV

Abstract: Existing text-to-image (T2I) diffusion models face several limitations, including large model sizes, slow runtime, and low-quality generation on mobile devices. This paper aims to address all of these challenges by developing an extremely small and fast T2I model that generates high-resolution and high-quality images on mobile platforms. We propose several techniques to achieve this goal. First, we systematically examine the design choices of the network architecture to reduce model parameters and latency, while ensuring high-quality generation. Second, to further improve generation quality, we employ cross-architecture knowledge distillation from a much larger model, using a multi-level approach to guide the training of our model from scratch. Third, we enable a few-step generation by integrating adversarial guidance with knowledge distillation. For the first time, our model SnapGen, demonstrates the generation of 1024x1024 px images on a mobile device around 1.4 seconds. On ImageNet-1K, our model, with only 372M parameters, achieves an FID of 2.06 for 256x256 px generation. On T2I benchmarks (i.e., GenEval and DPG-Bench), our model with merely 379M parameters, surpasses large-scale models with billions of parameters at a significantly smaller size (e.g., 7x smaller than SDXL, 14x smaller than IF-XL).

Summary

The paper introduces SnapGen, a model that reduces size by 45% and computation by 68% while maintaining high-quality image generation with a 2.06 FID score.
The method features a compact UNet architecture, a tiny fast decoder that is 36× smaller and 54× faster, and employs multi-level knowledge distillation.
SnapGen enables 1024 px image generation in 1.4 seconds on mobile devices, paving the way for real-time, accessible AI applications on constrained hardware.

SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training

The paper "SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training" addresses the problem of deploying high-performance text-to-image (T2I) models on mobile devices. Large-scale diffusion models, while effective for generating high-quality images, often struggle with their size and computational demands, making them impractical for mobile deployment. This paper introduces innovations through SnapGen, a highly efficient model that enables high-resolution T2I generation on mobile devices by integrating novel architectural designs and advanced training techniques.

Key Architectural and Training Innovations:

Efficient Network Architecture: The paper outlines the development of a compact version of the UNet architecture. Adjustments such as reducing the number of transformer blocks and utilizing separable convolutions reduced model parameters significantly while maintaining visual generation quality. This architecture achieved a 45% reduction in size and 68% reduction in computation compared to prior models without compromising the generation quality as measured by FID scores.
Tiny and Fast Decoder: Unlike existing autoencoders deployed in T2I models, which are resource-intensive, SnapGen's decoder leverages a streamlined design that maintains competitive reconstruction quality but is 36× smaller and 54× faster. Such optimization is essential for achieving real-time performance on mobile devices.
Advanced Training Techniques: The model employs a bespoke training routine including multi-level knowledge distillation. This approach enhances model performance by leveraging rich, high-capacity representations from a teacher model while distilling these into the smaller, more efficient student model. The use of a customized step distillation method allows the model to substantially reduce denoising steps, thereby speeding up the generation without losing quality.
Performance and Speed: With only 379 million parameters, SnapGen claims impressive metrics on various benchmarks, with a notable FID of 2.06 on the ImageNet-1K dataset. The model showcases an unparalleled ability to generate 1024 px images on a mobile platform in approximately 1.4 seconds, a significant enhancement over existing models.

Implications and Future Directions:

From a practical standpoint, SnapGen's development represents a significant step forward in making advanced AI models more accessible on constrained hardware, like smartphones. Such an advancement could pave the way for a broader range of applications where privacy, latency, and computational resource availability are critical, including in-the-field applications like augmented reality, media creation, and personalized content generation, without relying on cloud infrastructure.

Theoretically, SnapGen challenges existing understandings of the scalability and efficiency of diffusion models in constrained environments. The demonstrated success of cross-architecture knowledge distillation and adversarial training techniques provides a blueprint for integrating large-model benefits into more compact frameworks. Future research could explore further reduction of parameters and latency, expanding the techniques to other generative tasks, or refining the knowledge distillation process to better bridge complex architectures in varying operational contexts.

In summary, SnapGen advances the pursuit of efficient, high-quality image generation on mobile platforms, offering a balanced perspective between model capacity, computational efficiency, and accessibility. This work enhances our understanding of diffusion models, leading to promising avenues for future exploration in AI model deployment.

PDF Markdown

Related Papers

Tweets

https://twitter.com/arxivsanitybot/status/1868652251079721446

https://twitter.com/arXivGPT/status/1868719107064938916

YouTube

Show All Videos

Reddit

[2412.09619] SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training (2 points, 0 comments)