DDT: Decoupled Diffusion Transformer (2504.05741v2)

Published 8 Apr 2025 in cs.CV and cs.AI

Abstract: Diffusion transformers have demonstrated remarkable generation quality, albeit requiring longer training iterations and numerous inference steps. In each denoising step, diffusion transformers encode the noisy inputs to extract the lower-frequency semantic component and then decode the higher frequency with identical modules. This scheme creates an inherent optimization dilemma: encoding low-frequency semantics necessitates reducing high-frequency components, creating tension between semantic encoding and high-frequency decoding. To resolve this challenge, we propose a new \textbf{\color{ddt}D}ecoupled \textbf{\color{ddt}D}iffusion \textbf{\color{ddt}T}ransformer~(\textbf{\color{ddt}DDT}), with a decoupled design of a dedicated condition encoder for semantic extraction alongside a specialized velocity decoder. Our experiments reveal that a more substantial encoder yields performance improvements as model size increases. For ImageNet $256\times256$, Our DDT-XL/2 achieves a new state-of-the-art performance of {1.31 FID}~(nearly $4\times$ faster training convergence compared to previous diffusion transformers). For ImageNet $512\times512$, Our DDT-XL/2 achieves a new state-of-the-art FID of 1.28. Additionally, as a beneficial by-product, our decoupled architecture enhances inference speed by enabling the sharing self-condition between adjacent denoising steps. To minimize performance degradation, we propose a novel statistical dynamic programming approach to identify optimal sharing strategies.

PDF Abstract

Decoupled Diffusion Transformer: An Overview

The paper presents the Decoupled Diffusion Transformer (DDT), a novel architectural approach in the domain of diffusion models within machine learning, aimed at enhancing image generation tasks. By introducing an innovative structure involving separate components for semantic extraction and velocity decoding, the authors address the inherent optimization challenges posed by traditional diffusion transformers.

Key Innovations and Methodology

The core innovation of the DDT model lies in its architectural decoupling. Traditional diffusion transformers face a significant optimization dilemma: they attempt to encode low-frequency semantic components and decode high-frequency details using the same modules. This coupled architecture inherently limits the model's ability to efficiently handle both tasks concurrently due to conflicting requirements for encoding and decoding. To overcome this, the DDT employs a condition encoder and a velocity decoder.

The condition encoder is tasked with extracting self-condition features critical for semantic representation from noisy inputs. Concurrently, the velocity decoder leverages these encoded features to process the noisy latent data, aiming to accurately reconstruct high-frequency details. This division allows each module to specialize, thus improving overall model efficiency and performance.

The authors implement a statistical dynamic programming approach to strategically identify optimal encoder sharing strategies during inference. This technique further enhances efficiency by allowing certain computational processes to be shared between adjacent denoising steps without compromising performance quality.

Experimental Results and Numerical Highlights

The empirical evaluation conducted on the ImageNet dataset demonstrates the superior performance of the DDT model. On the ImageNet $256\times256$ dataset, the DDT achieved a Fréchet Inception Distance (FID) score of 1.31, which marks a significant improvement over existing diffusion transformers. Moreover, in terms of training efficiency, the DDT-XL/2 variant managed to converge nearly four times faster than prior models. For the ImageNet $512\times512$ dataset, the DDT achieved an FID of 1.28, underscoring its capability to handle high-resolution image generation tasks effectively.

Practical and Theoretical Implications

Practically, the DDT's architectural design reduces the computational burden typically associated with diffusion models. By accelerating both training and inference processes while maintaining high-quality output, this model is particularly advantageous for applications demanding real-time image generation or those operating under resource constraints.

Theoretically, the decoupled approach prompts a reevaluation of the traditional coupled methods employed in transformer-based models, suggesting that a modular approach may offer benefits beyond the scope of image generation. The insights gleaned from this research open avenues for exploring similar architectural innovations in other areas of machine learning, including natural language processing and video synthesis, potentially leading to advancements in both speed and performance.

Future Directions

Looking ahead, there are several intriguing directions for further exploration:

Generalization to Other Tasks: Investigating whether the decoupled architecture can be effectively adapted for other tasks within and outside of computer vision.
Scalability and Robustness: Further works could explore the scalability of DDT in more diverse and larger datasets, as well as its robustness in different noise environments or with limited data contexts.
Cross-disciplinary Applications: Given the successful decoupling of complex tasks in image processing, similar methodologies could be tested in contexts like audio processing or autonomous systems.

In summary, the Decoupled Diffusion Transformer represents a noteworthy advance in the design of transformer-based models for generative tasks. Its efficient architecture not only improves performance metrics on contemporary datasets but also challenges existing paradigms, paving the way for innovation across various applications in artificial intelligence.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Shuai Wang (466 papers)
Zhi Tian (68 papers)
Weilin Huang (61 papers)
Limin Wang (221 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/_akhaliq/status/1910244280842534928

https://twitter.com/bdsqlsz/status/1910246325498913167

https://twitter.com/TheAIObserverX/status/1910921311120933301

https://twitter.com/TheTuringPost/status/1912651536012071219

https://twitter.com/susumuota/status/1916286320999010544

YouTube

Show All Videos