ToDo: Token Downsampling for Efficient Generation of High-Resolution Images (2402.13573v3)

Published 21 Feb 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Attention mechanism has been crucial for image diffusion models, however, their quadratic computational complexity limits the sizes of images we can process within reasonable time and memory constraints. This paper investigates the importance of dense attention in generative image models, which often contain redundant features, making them suitable for sparser attention mechanisms. We propose a novel training-free method ToDo that relies on token downsampling of key and value tokens to accelerate Stable Diffusion inference by up to 2x for common sizes and up to 4.5x or more for high resolutions like 2048x2048. We demonstrate that our approach outperforms previous methods in balancing efficient throughput and fidelity.

References (19)

Authors (3)

Ethan Smith (27 papers)
Nayan Saxena (8 papers)
Aninda Saha (1 paper)

Citations (2)

View on Semantic Scholar

Summary

The paper presents a novel training-free token downsampling approach, ToDo, to reduce the computational load in high-resolution image generation.
It employs a grid-based downsampling technique to lower key and value token counts, achieving up to 2x speed for typical sizes and 4.5x for very high resolutions.
Empirical evaluations show that ToDo maintains image quality as measured by MSE and HPF scores, balancing efficiency and fidelity in diffusion models.

Efficient High-Resolution Image Generation through Token Downsampling

Introduction to Token Downsampling (ToDo)

In the pursuit of computational efficiency in image diffusion models, particularly those utilizing attention mechanisms such as Transformers, the quadratic computational complexity presents a significant challenge. This complexity impedes the processing of high-resolution images within reasonable timeframes and memory constraints on typical consumer-grade GPUs. The paper introduces Token Downsampling (ToDo), a novel, training-free method for accelerating Stable Diffusion inference significantly. This method innovatively leverages token downsampling to reduce key and value tokens in the attention mechanism, thus accelerating inference speeds by up to 2x for common image sizes and up to 4.5x or more for very high resolutions, like 2048x2048 pixels. ToDo distinguishes itself by outperforming previous methods in efficiently balancing throughput and fidelity without requiring training time modifications.

Methodology Overview

Sparse Attention and Prior Efforts

Prior approaches to mitigating the computational load of Transformers in image generation have explored sparse attention mechanisms and attention approximation techniques. These methods aim to reduce the number of computations by either merging similar tokens or mathematically simplifying the attention operation. However, these solutions often require modifications at the training phase and can introduce logistical overheads or compromise on image quality.

Introducing Token Downsampling

ToDo counters these drawbacks by implementing a training-free approach. By acknowledging the inherent similarity between adjacent pixels (or tokens) in images, ToDo utilizes a grid-based downsampling technique. This method bypasses the need for exhaustive pairwise similarity calculations, reducing computational overhead. Moreover, ToDo implements an adjusted attention mechanism that applies downsampling to keys and values while maintaining original queries, mitigating potential information loss typically associated with token merging methods.

Empirical Evaluation

Experimental Setup

The researchers conducted their experiments using the finetuned DreamshaperV7 model across various resolutions and token merging ratios. They assessed the fidelity and throughput of generated images using quantitative measures - Mean Squared Error (MSE) and High Pass Filter (HPF) scores.

Findings on Image Quality and Throughput

The results highlighted that ToDo not only closely mirrors the baseline in terms of MSE but also maintains HPF values, suggesting efficient retention of image sharpness and texture details. ToDo demonstrated superior performance in throughput, especially at higher resolutions, emphasizing its capability to balance efficiency and fidelity effectively.

Investigation of Latent Feature Redundancy

The paper also explores the redundancy of latent features within the U-Net of the Stable Diffusion model. Higher similarity among neighboring tokens suggests potential for computational savings without significant loss in image quality, supporting the foundational assumption behind ToDo.

Conclusions and Future Directions

ToDo offers a promising direction for accelerating inference in generative image models through token downsampling, proving particularly beneficial for high-resolution image generation. The method's training-free nature and its success in maintaining image fidelity while improving throughput mark a significant advancement in the field. Future research could explore the potential differentiability of ToDo and its application in finetuning diffusion models for larger image dimensions, unlocking new possibilities in high-resolution image generative tasks.

The investigation into token redundancy and the effectiveness of ToDo's downsampling approach provides valuable insights into the structure of images processed by attention mechanisms. Speculation on the broader applicability of ToDo to other attention-based generative models opens avenues for further exploration and optimization, potentially extending the method's benefits beyond the scope of Stable Diffusion models.

PDF Markdown

Related Papers

Tweets

https://twitter.com/SaxenaNayan/status/1767011101290430803

https://twitter.com/Gradio/status/1762052682204709343

https://twitter.com/RealAstropulse/status/1763329373531451589

https://twitter.com/Ethan_smith_20/status/1765826762439536838

https://twitter.com/_akhaliq/status/1760671395569422565

https://twitter.com/Ethan_smith_20/status/1846090684912357620

Reddit

ToDo: Token Downsampling for Efficient Generation of High-Resolution Images (10 points, 17 comments)
[2402.13573] ToDo: Token Downsampling for Efficient Generation of High-Resolution Images (1 point, 1 comment)