Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EdgeFusion: On-Device Text-to-Image Generation (2404.11925v1)

Published 18 Apr 2024 in cs.LG, cs.AI, and cs.CV

Abstract: The intensive computational burden of Stable Diffusion (SD) for text-to-image generation poses a significant hurdle for its practical application. To tackle this challenge, recent research focuses on methods to reduce sampling steps, such as Latent Consistency Model (LCM), and on employing architectural optimizations, including pruning and knowledge distillation. Diverging from existing approaches, we uniquely start with a compact SD variant, BK-SDM. We observe that directly applying LCM to BK-SDM with commonly used crawled datasets yields unsatisfactory results. It leads us to develop two strategies: (1) leveraging high-quality image-text pairs from leading generative models and (2) designing an advanced distillation process tailored for LCM. Through our thorough exploration of quantization, profiling, and on-device deployment, we achieve rapid generation of photo-realistic, text-aligned images in just two steps, with latency under one second on resource-limited edge devices.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Nota AI. BK-SDM-Tiny. https://huggingface.co/nota-ai/bk-sdm-tiny, 2023.
  2. Pixart-delta: Fast and controllable image generation with latent consistency models. arXiv preprint arXiv:2401.05252, 2024a.
  3. Pixart-alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In ICLR, 2024b.
  4. Squeezing large-scale diffusion models for mobile. In ICML Workshop, 2023.
  5. Flashattention: Fast and memory-efficient exact attention with io-awareness. In NeurIPS, 2022.
  6. Diffusion models beat gans on image synthesis. In NeurIPS, 2021.
  7. Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In ICML, 2021.
  8. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
  9. Progressive knowledge distillation of stable diffusion xl using layer level loss. arXiv preprint arXiv:2401.02677, 2024.
  10. Ptqd: Accurate post-training quantization for diffusion models. In NeurIPS, 2024.
  11. A comprehensive overhaul of feature distillation. In ICCV, 2019.
  12. Scaling up gans for text-to-image synthesis. In CVPR, 2023.
  13. Bk-sdm: A lightweight, fast, and cheap version of stable diffusion. arXiv preprint arXiv:2305.15798, 2023.
  14. Snapfusion: Text-to-image diffusion model on mobile devices within two seconds. In NeurIPS, 2023a.
  15. Snapfusion: Text-to-image diffusion model on mobile devices within two seconds. In NeurIPS, 2023b.
  16. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023.
  17. Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378, 2023.
  18. On distillation of guided diffusion models. In CVPR, 2023.
  19. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  20. The refinedweb dataset for falcon LLM: Outperforming curated corpora with web data only. In NeurIPS Workshop, 2023.
  21. SDXL: Improving latent diffusion models for high-resolution image synthesis. In ICLR, 2024.
  22. Stable Diffusion v1-4. https://huggingface.co/CompVis/stable-diffusion-v1-4, 2022.
  23. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  24. Fitnets: Hints for thin deep nets. In ICLR, 2015.
  25. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022.
  26. Samsung Semiconductor. Samsung exynos 2400. https://semiconductor.samsung.com/processor/mobile-processor/exynos-2400/, 2024.
  27. Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042, 2023.
  28. Laion-aesthetics. https://laion.ai/blog/laion-aesthetics, 2022.
  29. Laion-5b: An open large-scale dataset for training next generation image-text models. In NeurIPS Workshop, 2022.
  30. SG161222. Realistic-Vision-V5.1. https://huggingface.co/SG161222/Realistic_Vision_V5.1_noVAE, 2023.
  31. Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE transactions on parallel and distributed systems, 13(3):260–274, 2002.
  32. Ufogen: You forward once large scale text-to-image generation via diffusion gans. arXiv preprint arXiv:2311.09257, 2023.
  33. Mobilediffusion: Subsecond text-to-image generation on mobile devices. arXiv preprint arXiv:2311.16567, 2023.
Citations (3)

Summary

  • The paper introduces EdgeFusion, a novel framework that significantly reduces inference time for Stable Diffusion models on edge devices.
  • It employs advanced teacher-based distillation and fine-tuning with an LCM scheduler to dramatically cut denoising steps.
  • Optimized data preprocessing and mixed-precision quantization ensure high image quality and efficient deployment on NPUs.

Enhancing Stable Diffusion Models for Edge Deployments with Advanced Distillation and Optimized Data Strategies

Introduction to the Research

The research paper presents an innovative approach, termed EdgeFusion, aimed at addressing the significant computational challenges associated with deploying Stable Diffusion (SD) models on resource-constrained edge devices. The authors propose solutions that integrate architectural refinement, advanced model distillation, and tailored optimization of image-text data quality to significantly reduce inference time while maintaining high-quality text-to-image generation capabilities.

Proposed Methodology

Advanced Distillation for LCM

EdgeFusion builds upon a variant of the SD model called BK-SDM-Tiny, utilizing the Latent Consistency Model (LCM) for reducing sampling steps. The primary challenge addressed is the unsatisfactory results when directly deploying LCM on compact models using existing datasets. The researchers tackle this by employing a two-phase training process:

  1. Initial Training: Utilizing advanced "teacher" models to perform feature-level knowledge distillation.
  2. Fine-Tuning: Employing an LCM scheduler to refine the model further, ensuring robust reduction in denoising steps.

Enhanced Data Quality

A significant portion of this paper is dedicated to optimizing the input data quality, which includes:

  • Data Preprocessing: Techniques like deduplication and optimized cropping are used to improve the existing real-world dataset.
  • Synthetic Data Generation: To overcome the limitations of real-world data, the team uses AI to generate synthetic image-text pairs, ensuring higher control over data quality and diversity.
  • Manual Data Curation: Despite automated approaches, manual intervention is shown to further refine the data quality, achieving better model training outcomes.

Deployment on Edge Devices

The method includes specific adaptations for deployment on Neural Processing Units (NPU):

  • Model-level Tiling (MLT): This strategy is essential for managing limited memory on edge devices, facilitating efficient model operations by optimizing data handling between different types of memory within the device architecture.
  • Quantization: The approach utilizes mixed-precision quantization to adapt the model for execution on specific hardware, striking a balance between computational demand and model performance.

Experimental Setup and Data

The experimental setup detailed in the paper spans across various stages of training and deployment, which includes utilizing high-performance GPUs for model training and edge-specific NPUs for deployment evaluations. Concerning data, high-quality synthetic datasets and curated subsets play a crucial role in the experimental framework, enabling the refinement of model training processes.

Results and Observations

The EdgeFusion method demonstrated promising results:

  • Inference Efficiency: The model achieved a rapid generation of images with drastically reduced latency, operating under one second on resource-constrained devices.
  • Image Quality: The research provides substantial empirical evidence showing that the image quality remains high even with the reduced computational overhead.
  • Comparative Analysis: When compared with previous models, EdgeFusion shows a significant advancement in reducing inference steps while maintaining or enhancing the text-image alignment and image realism.

Implications and Future Work

The implications of this research are vast for real-world applications, especially in areas where computing resources are limited, such as mobile devices and embedded systems. The ability to deploy powerful generative models on such platforms could transform various industries, including mobile photography, augmented reality, and real-time visual content generation.

Looking forward, the strategies developed in this research could set a foundational framework for further explorations into model optimization for edge devices. Future work could explore the integration of these approaches with other AI-driven tasks, expanding the utility and efficiency of generative models in practical applications. Additionally, continuous improvements in dataset quality and distillation methods might lead to even faster and more efficient model deployments.

In summary, EdgeFusion represents a significant step forward in making sophisticated text-to-image models more accessible on devices with limited computational capacity, opening up new avenues for both academic research and practical applications in the field of generative AI.