Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers (2401.11605v1)

Published 21 Jan 2024 in cs.CV, cs.AI, and cs.LG

Abstract: We present the Hourglass Diffusion Transformer (HDiT), an image generative model that exhibits linear scaling with pixel count, supporting training at high-resolution (e.g. $1024 \times 1024$) directly in pixel-space. Building on the Transformer architecture, which is known to scale to billions of parameters, it bridges the gap between the efficiency of convolutional U-Nets and the scalability of Transformers. HDiT trains successfully without typical high-resolution training techniques such as multiscale architectures, latent autoencoders or self-conditioning. We demonstrate that HDiT performs competitively with existing models on ImageNet $256^2$, and sets a new state-of-the-art for diffusion models on FFHQ-$1024^2$.

Authors (6)

Katherine Crowson (3 papers)
Stefan Andreas Baumann (6 papers)
Alex Birch (2 papers)
Tanishq Mathew Abraham (6 papers)
Daniel Z. Kaplan (4 papers)
Enrico Shippole (6 papers)

Citations (29)

View on Semantic Scholar

Summary

Introduction

The Hourglass Diffusion Transformer (HDiT) is an innovative contribution to the domain of high-resolution image synthesis leveraging the robustness of diffusion models. The HDiT framework marks a departure from approaches dependent on latent variable models, heralding direct pixel-space generation at resolutions such as 1024x1024 while attaining a linear computational cost relative to pixel count. This advancement is instrumental in overcoming the limitations imposed by the quadratic scaling of computational complexity characteristic of existing transformer architectures.

Related Work

Transformers, renowned for their scalability, have historically grappled with quadratic computational costs due to their attention mechanism's nature. Thus far, diffusion models adapted transformers primarily for generating compressed latent embeddings and were seldom applied directly in pixel space due to ostensive limitations. The HDiT addresses this by marrying the efficiency of convolutional U-Nets with the scalability of transformers, sidestepping the high-resolution pitfalls such as the necessity for multiscale training architectures or complex self-conditioning techniques.

Hourglass Diffusion Transformers

The proposed architecture, HDiT, bypasses the need for the aforementioned high-resolution tactics through a pure transformer architecture that efficiently exploits the hierarchical structure of images. The 'hourglass' in HDiT is emblematic of the layered approach where several resolution levels are processed, starting with a high-level global perspective and systematically increasing resolution for local detail enhancements. Notably, these modifications do not sacrifice generative quality at lower resolutions. The architecture demonstrates competitive performance in terms of both image quality and computational efficiency.

Preliminaries and Results

In validating its approach, the paper meticulously contrasts computational costs between HDiT and prevalent models, illustrating its marked efficiency. Experiments show HDiT's adeptness at synthesis tasks on the FFHQ-1024 dataset, achieving sharp, detailed imagery without classifier-free guidance. Further, extensive evaluations on ImageNet-256 bear testament to the model's scalability and fidelity—attributes underscored by an advantageous FID score, even when compared to larger state-of-the-art models employing additional generative tricks.

Conclusion and Future Directions

HDiT distinguishes itself as a model that could significantly influence future research in efficient high-resolution image synthesis. While confined to unconditional and class-conditional image generation tasks, the potential applications of HDiT are broad, ranging from text-to-image generation to modality transcending formats like audio and video. The model's architecture, poised between efficiency and scalability, leaves ample room for future research—be it within latent diffusion setups for super resolution, or other generative applications—promising a breadth of efficient diffusion-based creation methodology.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/iScienceLuvr/status/1749624496770973816

https://twitter.com/RiversHaveWings/status/1749623266749358492

https://twitter.com/Birchlabs/status/1749623819113800115

https://twitter.com/_akhaliq/status/1749674957750108214

https://twitter.com/fly51fly/status/1749921439963316512

https://twitter.com/rami_mmo/status/1840521057460977698