Token Merging for Fast Stable Diffusion (2303.17604v1)

Published 30 Mar 2023 in cs.CV

Abstract: The landscape of image generation has been forever changed by open vocabulary diffusion models. However, at their core these models use transformers, which makes generation slow. Better implementations to increase the throughput of these transformers have emerged, but they still evaluate the entire model. In this paper, we instead speed up diffusion models by exploiting natural redundancy in generated images by merging redundant tokens. After making some diffusion-specific improvements to Token Merging (ToMe), our ToMe for Stable Diffusion can reduce the number of tokens in an existing Stable Diffusion model by up to 60% while still producing high quality images without any extra training. In the process, we speed up image generation by up to 2x and reduce memory consumption by up to 5.6x. Furthermore, this speed-up stacks with efficient implementations such as xFormers, minimally impacting quality while being up to 5.4x faster for large images. Code is available at https://github.com/dbolya/tomesd.

Citations (72)

View on Semantic Scholar

Summary

The paper demonstrates that token merging reduces token count by up to 60%, doubling generation speed without requiring retraining.
It adapts the ToMe technique to diffusion models with an innovative unmerging process that preserves dense predictions and image quality.
Experimental evaluations confirm significant speed and memory improvements with negligible quality degradation, advancing efficient image generation.

Token Merging for Efficient Image Generation with Stable Diffusion

The paper "Token Merging for Fast Stable Diffusion" introduces an innovative approach for enhancing the efficiency of diffusion-based image generation models, particularly focusing on Stable Diffusion. Diffusion models, lauded for their ability to generate high-quality images by iteratively denoising through transformer-based mechanisms, inherently demand significant computational resources. This paper proposes a novel method—building on the concept of Token Merging (ToMe)—to intelligently reduce computational overhead without additional training and while preserving image quality.

Key Contributions and Methodology

At the crux of this research lies the realization of redundancy in token representation within generated images. By exploiting this redundancy, the authors adapt the Token Merging (ToMe) technique, previously applied to ViT-based models for classification tasks, to diffusion models aimed at image generation. Unlike other methods requiring model retraining, the adapted ToMe method efficiently merges tokens in the transformers of the Stable Diffusion model. This allows the reduction of token count by as much as 60%, enhancing generation speed by up to 2 times, and decreasing memory consumption by up to 5.6 times.

Highlights of the Methodology:

Token Merging without Retraining: ToMe's efficiency is notable as it requires no additional model training, significantly reducing the barriers to applying it in practice.
Unmerging Tokens for Dense Predictions: The paper introduces unmerging as a critical enhancement, allowing merged tokens to be expanded back into their original form, thereby maintaining dense prediction requirements essential for high-fidelity image generation.
Adaptation to Diffusion Contexts: Experimental modifications, such as applying ToMe selectively to self-attention or specific network layers, illustrate that strategic token management can retain quality while significantly enhancing efficiency.

Experimental Evaluation

The authors provide comprehensive experimental evidence to support their claims. Through a series of ablation studies and qualitative assessments, they validate their adjustments to ToMe, such as utilizing random partitioning methods and limiting application to specific model blocks. These experiments culminate in a robust, adaptable method that retains image quality, as evidenced by improved FID scores and practical speed and memory benefits demonstrated across various settings.

Experimentally, the findings are robust:

Even at a high token reduction (up to 60%), the model exhibits negligible loss in image quality.
The combination of Token Merging with efficient transformer implementations, like xFormers, showcases a potential 5.4 times speedup for large image outputs, emphasizing practical applicability in resource-constrained environments.

Implications and Future Directions

The implications of this research extend to both practical and theoretical realms. Practically, it offers an immediately applicable method for reducing computational and memory footprints in image generation tasks within already trained models, making it particularly beneficial for resource-intensive applications or deploying models in environments with limited computational capacity. Theoretically, this work highlights an area ripe for further exploration: the balance between token efficiency and model fidelity in generative models. The ability of ToMe to operate without retraining suggests intriguing possibilities for its application across various architectures and tasks, particularly those involving dense predictions or large token sets.

In conclusion, "Token Merging for Fast Stable Diffusion" presents a compelling advancement in the optimization of diffusion models, paving the way for more efficient and accessible image generation technologies. As AI continues to evolve, the methods illustrated here may serve as foundational components for future adaptive and resource-efficient architectures.

PDF Markdown

Related Papers

GitHub

GitHub - dbolya/tomesd: Speed up Stable Diffusion with this one simple trick! (1,361 stars)