DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation (2412.18597v1)

Published 24 Dec 2024 in cs.CV, cs.AI, and cs.MM

Abstract: Sora-like video generation models have achieved remarkable progress with a Multi-Modal Diffusion Transformer MM-DiT architecture. However, the current video generation models predominantly focus on single-prompt, struggling to generate coherent scenes with multiple sequential prompts that better reflect real-world dynamic scenarios. While some pioneering works have explored multi-prompt video generation, they face significant challenges including strict training data requirements, weak prompt following, and unnatural transitions. To address these problems, we propose DiTCtrl, a training-free multi-prompt video generation method under MM-DiT architectures for the first time. Our key idea is to take the multi-prompt video generation task as temporal video editing with smooth transitions. To achieve this goal, we first analyze MM-DiT's attention mechanism, finding that the 3D full attention behaves similarly to that of the cross/self-attention blocks in the UNet-like diffusion models, enabling mask-guided precise semantic control across different prompts with attention sharing for multi-prompt video generation. Based on our careful design, the video generated by DiTCtrl achieves smooth transitions and consistent object motion given multiple sequential prompts without additional training. Besides, we also present MPVBench, a new benchmark specially designed for multi-prompt video generation to evaluate the performance of multi-prompt generation. Extensive experiments demonstrate that our method achieves state-of-the-art performance without additional training.

PDF Abstract

DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer

This paper addresses the critical challenge of multi-prompt video generation by proposing a novel method, DiTCtrl, which leverages the Multi-Modal Diffusion Transformer (MM-DiT) architecture. Existing approaches to video generation from text prompts often face limitations in generating coherent video sequences involving multiple sequential prompts. DiTCtrl innovatively circumvents the need for additional training by interpreting the multi-prompt generation task as a temporal video editing problem, applying advanced techniques in attention control to achieve semantic consistency and smooth transitions.

The research acknowledges the limitations of current video generation models that predominantly focus on single-prompt scenarios. While there are some efforts toward multi-prompt video generation, these are hampered by strict training requirements and ineffective prompt transitions. DiTCtrl stands out as a training-free solution, building on the MM-DiT architecture’s inherent strengths. The paper offers a meticulous analysis of MM-DiT's attention mechanism, revealing that its 3D full attention operates similarly to the cross/self-attention blocks found in UNet-like diffusion models. This insight is harnessed to apply mask-guided attention sharing, which maintains semantic stability across multiple prompts, thereby facilitating the generation of coherent and consistent video transitions.

DiTCtrl’s efficacy is further augmented by introducing MPVBench, a new benchmark tailored for evaluating multi-prompt video generation performance. The benchmark provides a comprehensive assessment tool with diverse transition types and specialized metrics. Through extensive experimentation, DiTCtrl has demonstrated state-of-the-art performance in seamlessly transitioning between prompts without additional training, underscoring the method’s computational efficiency and effectiveness.

Substantive numerical results indicate that DiTCtrl not only achieves superior prompt adherence but also excels in producing long, coherent video sequences characterized by smooth transitions. This is achieved through a combination of mask-guided KV-sharing and a latent blending strategy, which facilitates seamless connections between video clip segments. The experimental results, supported by quantitative evaluations using the proposed MPVBench, substantiate the claim of DiTCtrl’s enhanced capability in generating high-quality, multi-prompt video sequences.

The practical implications of DiTCtrl are far-reaching, offering potential advancements in applications such as automated video content creation and editing, where multiple narrative components need integration. On the theoretical side, the paper contributes to the understanding of how attention mechanisms in MM-DiT can be adapted for more nuanced video generation tasks, suggesting new pathways for enhancing diffusion transformer architectures.

In terms of future developments, the method opens avenues for further research by leveraging the scalability of DiT architectures to incorporate more complex motion dynamics and richer semantic transitions in longer videos. There is potential for refining the attention control mechanisms to handle even more diverse and complex prompt scenarios, thus extending the applicability of this technique in sophisticated multimedia content generation and other domains requiring high levels of semantic control. DiTCtrl represents a significant advancement in the field of AI-driven video generation, offering a scalable, efficient, and coherent approach that aligns closely with the dynamics of real-world scenarios.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Minghong Cai (3 papers)
Xiaodong Cun (61 papers)
Xiaoyu Li (348 papers)
Wenze Liu (12 papers)
Zhaoyang Zhang (273 papers)
Yong Zhang (660 papers)
Ying Shan (252 papers)
Xiangyu Yue (93 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/taziku_co/status/1872170773424812288

https://twitter.com/taziku_co/status/1872061257001640080

https://twitter.com/MultimediaPaper/status/1871826384005321068

https://twitter.com/arXivGPT/status/1872342564298166388