Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

119 tokens/sec

GPT-4o

56 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

47 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

1.0k 2

MobileDiffusion: Instant Text-to-Image Generation on Mobile Devices (2311.16567v2)

Published 28 Nov 2023 in cs.CV

Abstract: The deployment of large-scale text-to-image diffusion models on mobile devices is impeded by their substantial model size and slow inference speed. In this paper, we propose \textbf{MobileDiffusion}, a highly efficient text-to-image diffusion model obtained through extensive optimizations in both architecture and sampling techniques. We conduct a comprehensive examination of model architecture design to reduce redundancy, enhance computational efficiency, and minimize model's parameter count, while preserving image generation quality. Additionally, we employ distillation and diffusion-GAN finetuning techniques on MobileDiffusion to achieve 8-step and 1-step inference respectively. Empirical studies, conducted both quantitatively and qualitatively, demonstrate the effectiveness of our proposed techniques. MobileDiffusion achieves a remarkable \textbf{sub-second} inference speed for generating a $512\times512$ image on mobile devices, establishing a new state of the art.

References (62)

Authors (5)

Yang Zhao (382 papers)
Yanwu Xu (78 papers)
Zhisheng Xiao (17 papers)
Tingbo Hou (25 papers)
Haolin Jia (4 papers)

Citations (3)

View on Semantic Scholar

Summary

MobileDiffusion: Subsecond Text-to-Image Generation on Mobile Devices

In "MobileDiffusion: Subsecond Text-to-Image Generation on Mobile Devices," Zhao et al. address the significant challenge of deploying large-scale text-to-image diffusion models on mobile devices due to their substantial model size and slow inference speed. The proposed solution, MobileDiffusion, introduces a highly efficient text-to-image diffusion model optimized through comprehensive architectural and sampling technique improvements. This paper offers valuable insights into enabling state-of-the-art text-to-image generation within the constraints of mobile computing environments.

Summary of Contributions

The paper provides multiple key contributions:

Efficient Model Architecture: The authors investigate and optimize the UNet-based architecture commonly used in diffusion models. They introduce modifications to reduce redundancy, enhance computational efficiency, and minimize model parameters.
Advanced Sampling Techniques: The paper combines advanced numerical solvers and distillation techniques to significantly reduce the number of sampling steps required for image generation.
Empirical Validation: Through extensive empirical studies, both quantitative and qualitative, the authors demonstrate that MobileDiffusion achieves sub-second inference speeds for generating high-quality images on mobile devices.

Architecture Optimization

The inefficiency of text-to-image diffusion models stems from the need for iterative denoising and the complex network architecture involving a high number of parameters. The authors address these issues with a detailed examination of the UNet architecture. Key optimizations include:

Transformer and Convolutional Block Reorganization: They investigate the role of transformer blocks and advocate for selective removal of self-attention layers at high resolutions while retaining cross-attention. This approach maintains model performance while enhancing efficiency.
Activation and Parameter Sharing: Replacing $\mathsf{gelu}$ with $\mathsf{swish}$ and sharing parameters between attention layers reduces computational costs without quality degradation.
Lightweight Convolutions: Adopting separable convolutions in deeper network sections further reduces parameter count and enhances runtime efficiency.

These optimizations culminate in a model architecture boasting fewer than 400 million parameters and substantial gains in computational efficiency.

Sampling Efficiency

To further enhance the model's deployment feasibility on mobile devices, the authors implement:

Progressive Distillation: By recursively applying distillation techniques, MobileDiffusion reduces the required sampling steps to as few as eight, preserving image quality and reducing inference time.
Diffusion-GAN Hybrid: Utilizing the UFOGen approach, the model is fine-tuned with a hybrid objective, enabling inferences in a single step without significant quality loss.

Empirical Results

Empirical validation demonstrates MobileDiffusion’s capabilities. The model achieves a Fréchet Inception Distance (FID) of 9.01 with eight steps, comparable to larger and slower models. The resulting image quality, measured by the CLIP score, and visual inspections validate the effectiveness of architectural and sampling optimizations.

Quantitative comparisons with other state-of-the-art text-to-image models underscore MobileDiffusion's efficiency. The demonstration on mobile devices, specifically achieving sub-second inference on an iPhone 15 Pro, establishes a new benchmark in mobile text-to-image generation.

Practical and Theoretical Implications

The practical implications of this research are profound, offering a pathway for deploying high-quality generative models on resource-constrained devices. This advancement opens up numerous applications, from real-time image editing and augmented reality to personalization features in mobile applications. Theoretically, the approach sets a precedent for future research in optimizing large-scale generative models for edge devices, highlighting the trade-offs between architectural complexity, parameter count, and inference efficiency.

Future Directions

Anticipated future developments include extending these optimizations to pixel-based models and exploring more advanced distillation and finetuning techniques. Continued research could also investigate integrating these models with other on-device functionalities to enhance user experience further.

In conclusion, Zhao et al.'s "MobileDiffusion" delivers significant advancements in making high-quality text-to-image generation feasible on mobile devices. The comprehensive architectural redesign and innovative sampling techniques highlight the potential for deploying sophisticated AI models on constrained hardware, paving the way for broader accessibility and utility of AI-driven applications.

Tweets

https://twitter.com/_akhaliq/status/1752831148064157797

https://twitter.com/_akhaliq/status/1752829343015862503

https://twitter.com/fly51fly/status/1753178069505028514

https://twitter.com/RiadEtm/status/1753192630438043769

https://twitter.com/Flux159/status/1755003939743838687

https://twitter.com/gzlin/status/1763633459598115289

YouTube

Show All Videos