EdgeFusion: On-Device Text-to-Image Generation (2404.11925v1)
Abstract: The intensive computational burden of Stable Diffusion (SD) for text-to-image generation poses a significant hurdle for its practical application. To tackle this challenge, recent research focuses on methods to reduce sampling steps, such as Latent Consistency Model (LCM), and on employing architectural optimizations, including pruning and knowledge distillation. Diverging from existing approaches, we uniquely start with a compact SD variant, BK-SDM. We observe that directly applying LCM to BK-SDM with commonly used crawled datasets yields unsatisfactory results. It leads us to develop two strategies: (1) leveraging high-quality image-text pairs from leading generative models and (2) designing an advanced distillation process tailored for LCM. Through our thorough exploration of quantization, profiling, and on-device deployment, we achieve rapid generation of photo-realistic, text-aligned images in just two steps, with latency under one second on resource-limited edge devices.
- Nota AI. BK-SDM-Tiny. https://huggingface.co/nota-ai/bk-sdm-tiny, 2023.
- Pixart-delta: Fast and controllable image generation with latent consistency models. arXiv preprint arXiv:2401.05252, 2024a.
- Pixart-alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In ICLR, 2024b.
- Squeezing large-scale diffusion models for mobile. In ICML Workshop, 2023.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. In NeurIPS, 2022.
- Diffusion models beat gans on image synthesis. In NeurIPS, 2021.
- Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In ICML, 2021.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
- Progressive knowledge distillation of stable diffusion xl using layer level loss. arXiv preprint arXiv:2401.02677, 2024.
- Ptqd: Accurate post-training quantization for diffusion models. In NeurIPS, 2024.
- A comprehensive overhaul of feature distillation. In ICCV, 2019.
- Scaling up gans for text-to-image synthesis. In CVPR, 2023.
- Bk-sdm: A lightweight, fast, and cheap version of stable diffusion. arXiv preprint arXiv:2305.15798, 2023.
- Snapfusion: Text-to-image diffusion model on mobile devices within two seconds. In NeurIPS, 2023a.
- Snapfusion: Text-to-image diffusion model on mobile devices within two seconds. In NeurIPS, 2023b.
- Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023.
- Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378, 2023.
- On distillation of guided diffusion models. In CVPR, 2023.
- OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- The refinedweb dataset for falcon LLM: Outperforming curated corpora with web data only. In NeurIPS Workshop, 2023.
- SDXL: Improving latent diffusion models for high-resolution image synthesis. In ICLR, 2024.
- Stable Diffusion v1-4. https://huggingface.co/CompVis/stable-diffusion-v1-4, 2022.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- Fitnets: Hints for thin deep nets. In ICLR, 2015.
- Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022.
- Samsung Semiconductor. Samsung exynos 2400. https://semiconductor.samsung.com/processor/mobile-processor/exynos-2400/, 2024.
- Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042, 2023.
- Laion-aesthetics. https://laion.ai/blog/laion-aesthetics, 2022.
- Laion-5b: An open large-scale dataset for training next generation image-text models. In NeurIPS Workshop, 2022.
- SG161222. Realistic-Vision-V5.1. https://huggingface.co/SG161222/Realistic_Vision_V5.1_noVAE, 2023.
- Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE transactions on parallel and distributed systems, 13(3):260–274, 2002.
- Ufogen: You forward once large scale text-to-image generation via diffusion gans. arXiv preprint arXiv:2311.09257, 2023.
- Mobilediffusion: Subsecond text-to-image generation on mobile devices. arXiv preprint arXiv:2311.16567, 2023.