Scalable training of Diffusion Transformers at ultra-high resolutions
Develop training methodologies or architectures that enable Diffusion Transformers with self-attention to be trained at ultra-high image resolutions (e.g., 4K and beyond) without prohibitive memory and compute, by overcoming or circumventing the quadratic complexity of self-attention with respect to the number of image tokens.
References
Yet, training these architectures on ultra-high resolutions (e.g., $4$K and beyond) remains an open challenge due to the quadratic cost of self-attention, which quickly becomes prohibitive in both memory and computation at such resolutions.
— DyPE: Dynamic Position Extrapolation for Ultra High Resolution Diffusion
(2510.20766 - Issachar et al., 23 Oct 2025) in Related Work, first paragraph