Scalable training of Diffusion Transformers at ultra-high resolutions

Develop training methodologies or architectures that enable Diffusion Transformers with self-attention to be trained at ultra-high image resolutions (e.g., 4K and beyond) without prohibitive memory and compute, by overcoming or circumventing the quadratic complexity of self-attention with respect to the number of image tokens.

Background

Diffusion Transformers combine diffusion modeling with transformer-based self-attention, yielding strong text-to-image performance but incurring quadratic costs in the number of tokens. This makes direct training at ultra-high resolutions (such as 4K and above) computationally and memory prohibitive.

The DyPE method introduced in the paper focuses on inference-time positional encoding adaptation to improve ultra-high-resolution generation without retraining, but it does not solve the fundamental training scalability issue. Consequently, establishing feasible training approaches for ultra-high-resolution Diffusion Transformers remains unresolved.

References

Yet, training these architectures on ultra-high resolutions (e.g., $4$K and beyond) remains an open challenge due to the quadratic cost of self-attention, which quickly becomes prohibitive in both memory and computation at such resolutions.

— DyPE: Dynamic Position Extrapolation for Ultra High Resolution Diffusion (2510.20766 - Issachar et al., 23 Oct 2025) in Related Work, first paragraph

Scalable training of Diffusion Transformers at ultra-high resolutions

Background

References

Related Problems