Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MobileDiffusion: Instant Text-to-Image Generation on Mobile Devices (2311.16567v2)

Published 28 Nov 2023 in cs.CV

Abstract: The deployment of large-scale text-to-image diffusion models on mobile devices is impeded by their substantial model size and slow inference speed. In this paper, we propose \textbf{MobileDiffusion}, a highly efficient text-to-image diffusion model obtained through extensive optimizations in both architecture and sampling techniques. We conduct a comprehensive examination of model architecture design to reduce redundancy, enhance computational efficiency, and minimize model's parameter count, while preserving image generation quality. Additionally, we employ distillation and diffusion-GAN finetuning techniques on MobileDiffusion to achieve 8-step and 1-step inference respectively. Empirical studies, conducted both quantitatively and qualitatively, demonstrate the effectiveness of our proposed techniques. MobileDiffusion achieves a remarkable \textbf{sub-second} inference speed for generating a $512\times512$ image on mobile devices, establishing a new state of the art.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
  2. Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. arXiv preprint arXiv:2201.06503, 2022.
  3. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023.
  4. Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023a.
  5. Pali: A jointly-scaled multilingual language-image model. In International Conference on Learning Representations, 2023b.
  6. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023.
  7. Squeezing large-scale diffusion models for mobile. arXiv preprint arXiv:2307.01193, 2023.
  8. Genie: Higher-order denoising diffusion solvers. Advances in Neural Information Processing Systems, 35:30150–30166, 2022.
  9. An image is worth one word: Personalizing text-to-image generation using textual inversion. In The Eleventh International Conference on Learning Representations, 2022.
  10. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
  11. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  12. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  13. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  14. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  15. simple diffusion: End-to-end diffusion for high resolution images. arXiv preprint arXiv:2301.11093, 2023.
  16. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  17. Deep networks with stochastic depth. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 646–661. Springer, 2016.
  18. Scaling up gans for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10124–10134, 2023.
  19. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35:26565–26577, 2022.
  20. On architectural compression of text-to-image diffusion models. arXiv preprint arXiv:2305.15798, 2023a.
  21. BK-SDM: Architecturally compressed stable diffusion for efficient text-to-image generation. In Workshop on Efficient Systems for Foundation Models @ ICML2023, 2023b.
  22. Your diffusion model is secretly a zero-shot classifier. arXiv preprint arXiv:2303.16203, 2023a.
  23. Snapfusion: Text-to-image diffusion model on mobile devices within two seconds. arXiv preprint arXiv:2306.00980, 2023b.
  24. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  25. Instaflow: One step is enough for high-quality diffusion-based text-to-image generation. arXiv preprint arXiv:2309.06380, 2023.
  26. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022.
  27. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
  28. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022a.
  29. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022b.
  30. Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378, 2023.
  31. Sdedit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2021.
  32. On distillation of guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14297–14306, 2023.
  33. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  34. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning, pages 16784–16804. PMLR, 2022.
  35. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
  36. Sdxl: improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  37. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  38. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  39. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  40. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  41. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
  42. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  43. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2022.
  44. Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
  45. Denoising diffusion implicit models. In International Conference on Learning Representations, 2020a.
  46. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2020b.
  47. Consistency models. 2023.
  48. Mobilebert: a compact task-agnostic bert for resource-limited devices. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2158–2170, 2020.
  49. Efficient transformers: A survey. ACM Comput. Surv., 55(6), 2022.
  50. Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1921–1930, 2023.
  51. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  52. Towards efficient vision transformer inference: A first study of transformers on mobile devices. In Proceedings of the 23rd Annual International Workshop on Mobile Computing Systems and Applications, pages 1–7, 2022.
  53. Replacing softmax with relu in vision transformers. arXiv preprint arXiv:2309.08586, 2023.
  54. Tackling the generative learning trilemma with denoising diffusion gans. In International Conference on Learning Representations, 2022.
  55. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2955–2966, 2023a.
  56. Semi-implicit denoising diffusion models (siddms). arXiv preprint arXiv:2306.12511, 2023b.
  57. Ufogen: You forward once large scale text-to-image generation via diffusion gans. arXiv preprint arXiv:2311.09257, 2023c.
  58. Slimmable neural networks. In International Conference on Learning Representations, 2018.
  59. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5, 2022.
  60. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
  61. On-device diffusion plugins for conditioned text-to-image generation. https://blog.research.google/2023/06/on-device-diffusion-plugins-for.html, 2023.
  62. Towards language-free training for text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17907–17917, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yang Zhao (382 papers)
  2. Yanwu Xu (78 papers)
  3. Zhisheng Xiao (17 papers)
  4. Tingbo Hou (25 papers)
  5. Haolin Jia (4 papers)
Citations (3)

Summary

MobileDiffusion: Subsecond Text-to-Image Generation on Mobile Devices

In "MobileDiffusion: Subsecond Text-to-Image Generation on Mobile Devices," Zhao et al. address the significant challenge of deploying large-scale text-to-image diffusion models on mobile devices due to their substantial model size and slow inference speed. The proposed solution, MobileDiffusion, introduces a highly efficient text-to-image diffusion model optimized through comprehensive architectural and sampling technique improvements. This paper offers valuable insights into enabling state-of-the-art text-to-image generation within the constraints of mobile computing environments.

Summary of Contributions

The paper provides multiple key contributions:

  1. Efficient Model Architecture: The authors investigate and optimize the UNet-based architecture commonly used in diffusion models. They introduce modifications to reduce redundancy, enhance computational efficiency, and minimize model parameters.
  2. Advanced Sampling Techniques: The paper combines advanced numerical solvers and distillation techniques to significantly reduce the number of sampling steps required for image generation.
  3. Empirical Validation: Through extensive empirical studies, both quantitative and qualitative, the authors demonstrate that MobileDiffusion achieves sub-second inference speeds for generating high-quality images on mobile devices.

Architecture Optimization

The inefficiency of text-to-image diffusion models stems from the need for iterative denoising and the complex network architecture involving a high number of parameters. The authors address these issues with a detailed examination of the UNet architecture. Key optimizations include:

  • Transformer and Convolutional Block Reorganization: They investigate the role of transformer blocks and advocate for selective removal of self-attention layers at high resolutions while retaining cross-attention. This approach maintains model performance while enhancing efficiency.
  • Activation and Parameter Sharing: Replacing gelu\mathsf{gelu} with swish\mathsf{swish} and sharing parameters between attention layers reduces computational costs without quality degradation.
  • Lightweight Convolutions: Adopting separable convolutions in deeper network sections further reduces parameter count and enhances runtime efficiency.

These optimizations culminate in a model architecture boasting fewer than 400 million parameters and substantial gains in computational efficiency.

Sampling Efficiency

To further enhance the model's deployment feasibility on mobile devices, the authors implement:

  • Progressive Distillation: By recursively applying distillation techniques, MobileDiffusion reduces the required sampling steps to as few as eight, preserving image quality and reducing inference time.
  • Diffusion-GAN Hybrid: Utilizing the UFOGen approach, the model is fine-tuned with a hybrid objective, enabling inferences in a single step without significant quality loss.

Empirical Results

Empirical validation demonstrates MobileDiffusion’s capabilities. The model achieves a Fréchet Inception Distance (FID) of 9.01 with eight steps, comparable to larger and slower models. The resulting image quality, measured by the CLIP score, and visual inspections validate the effectiveness of architectural and sampling optimizations.

Quantitative comparisons with other state-of-the-art text-to-image models underscore MobileDiffusion's efficiency. The demonstration on mobile devices, specifically achieving sub-second inference on an iPhone 15 Pro, establishes a new benchmark in mobile text-to-image generation.

Practical and Theoretical Implications

The practical implications of this research are profound, offering a pathway for deploying high-quality generative models on resource-constrained devices. This advancement opens up numerous applications, from real-time image editing and augmented reality to personalization features in mobile applications. Theoretically, the approach sets a precedent for future research in optimizing large-scale generative models for edge devices, highlighting the trade-offs between architectural complexity, parameter count, and inference efficiency.

Future Directions

Anticipated future developments include extending these optimizations to pixel-based models and exploring more advanced distillation and finetuning techniques. Continued research could also investigate integrating these models with other on-device functionalities to enhance user experience further.

In conclusion, Zhao et al.'s "MobileDiffusion" delivers significant advancements in making high-quality text-to-image generation feasible on mobile devices. The comprehensive architectural redesign and innovative sampling techniques highlight the potential for deploying sophisticated AI models on constrained hardware, paving the way for broader accessibility and utility of AI-driven applications.

Youtube Logo Streamline Icon: https://streamlinehq.com