The paper presents an in-depth paper of transformer-based diffusion models for text-conditioned image and video synthesis. The work builds on prior transformer diffusion architectures by extending them beyond class-conditioned scenarios to free-form text conditioning, and further scaling the model capacity significantly. Detailed empirical evaluations underscore the advantages of transformer designs over traditional convolutional U-Net backbones in the context of generative modeling, with particular emphasis on compositionality and spatial accuracy.
The paper addresses several key aspects:
- Text Conditioning Mechanisms:
The paper transitions from the conventional class-based conditioning (as used in earlier diffusion models) to free-form text conditioning. It rigorously investigates different conditioning strategies, comparing adaptive layer normalization (adaLN) and cross-attention. The results indicate that while adaLN is effective for static, limited signal cases, cross-attention is markedly superior for handling the spatially heterogeneous information present in natural language descriptions. The authors also perform a comparative paper using various text encoder models, including those from multimodal models like CLIP and models from pure language processing such as Flan-T5 (Flan-T5 LLM), as well as their combination, to leverage complementary strengths in providing robust textual guidance.
- Scaling Transformer-Based Diffusion:
A core contribution is the exploration of scaling transformer-based diffusion models. The baseline model, derived from DiT, is scaled from approximately 900 million parameters to over 3 billion parameters by increasing the model depth (number of transformer blocks), width (embedding dimensions), and MLP hidden dimensions. Quantitative evaluations on compositional benchmarks highlight consistent improvements across metrics such as attribute binding (color, shape, texture) and object relationships. Qualitative comparisons also reveal significant refinement in the spatial layout and fine details, particularly for complex prompts.
- Extension to Text-to-Video Generation:
The authors extend the framework to video synthesis by incorporating temporal modeling within each transformer block. A lightweight temporal self-attention layer is interleaved between the cross-attention and MLP modules, and the model is further adapted via joint image-video training. A novel concept called motion-free guidance (MFG) is introduced. Inspired by classifier-free guidance, MFG intermittently replaces the temporal self-attention mask with an identity matrix with a preset probability. This effectively disables motion modeling during selected training steps, thereby preserving per-frame visual quality while still capturing coherent temporal dynamics during video generation. The inference process also involves a modified score estimate where distinct guidance scales control text and motion conditions independently.
- Empirical and Comparative Evaluation:
Extensive experiments demonstrate that the cross-attention mechanism and the joint use of CLIP and T5-based text embeddings yield superior performance compared to alternative approaches. The scaled model (over 3B parameters) shows robust improvements on benchmarks like T2I-CompBench, with a significant margin in compositional tasks. Human studies reveal that the proposed method attains higher win rates in visual quality and text alignment compared to competitive baselines such as SDXL. The paper also provides detailed ablations showing that motion-free guidance leads to enhanced focus on the key objects mentioned in prompts and improved overall video quality.
- Implementation Details and Training Strategies:
The training procedure is multi-staged, starting with low-resolution image generation and progressively moving to higher resolutions. The methodology uses advanced parallelization techniques like Fully Sharded Data Parallel (FSDP) and activation checkpointing to manage the increased computational requirements of the larger model. The video training leverages a large-scale dataset with joint image-video samples to counterbalance the limited quality and quantity of available video data. Furthermore, the paper details theoretical formulations of the diffusion process, including equations for the forward and reverse diffusion steps, and the incorporation of temporal self-attention, providing mathematical clarity on the underlying processes.
Overall, the paper offers a comprehensive exploration of how transformer architectures can be effectively integrated with diffusion models for both image and video generation. It not only illustrates the scalability advantages of transformer-based models in this domain but also pioneers strategiesβsuch as motion-free guidanceβthat mitigate inherent challenges in video synthesis while maintaining high visual fidelity and compositional consistency.