FullDiT: Multi-Task Video Generative Foundation Model with Full Attention (2503.19907v1)

Published 25 Mar 2025 in cs.CV

Abstract: Current video generative foundation models primarily focus on text-to-video tasks, providing limited control for fine-grained video content creation. Although adapter-based approaches (e.g., ControlNet) enable additional controls with minimal fine-tuning, they encounter challenges when integrating multiple conditions, including: branch conflicts between independently trained adapters, parameter redundancy leading to increased computational cost, and suboptimal performance compared to full fine-tuning. To address these challenges, we introduce FullDiT, a unified foundation model for video generation that seamlessly integrates multiple conditions via unified full-attention mechanisms. By fusing multi-task conditions into a unified sequence representation and leveraging the long-context learning ability of full self-attention to capture condition dynamics, FullDiT reduces parameter overhead, avoids conditions conflict, and shows scalability and emergent ability. We further introduce FullBench for multi-task video generation evaluation. Experiments demonstrate that FullDiT achieves state-of-the-art results, highlighting the efficacy of full-attention in complex multi-task video generation.

Summary

Multi-Task Video Generative Foundation Model with Full Attention: FullDiT

The paper introduces FullDiT, a novel approach to video generation that directly addresses the limitations of existing video generative foundation models focused primarily on text-to-video tasks. Typical models like ControlNet and T2I-Adapter have struggled with incorporating multiple conditions due to conflicts between independently trained adapters, increased computational costs from redundant parameters, and underperformance compared to full fine-tuning. FullDiT seeks to overcome these challenges through a unified architecture that leverages full self-attention mechanisms to effectively manage and integrate multi-task conditions, providing scalability and unlocking emergent capabilities.

The core innovation of FullDiT lies in its unified attention framework. By considering conditions as components of a single coherent sequential representation, FullDiT achieves robust cross-modal representation learning. This effectively models complex temporal and spatial correlations, resolving the conflicts characteristic of adapter-based methods and reducing unnecessary parameters by avoiding separate adapters. Therefore, FullDiT can perform end-to-end training efficiently while maintaining high-quality video generation capabilities across diverse tasks.

FullDiT encompasses flexible task incorporation, allowing various input combinations and supporting modalities beyond text, such as camera trajectories, character identities, and depth formats. This adaptability is crucial for applications in creative industries, where multifaceted control is paramount. FullDiT's architecture is designed to accommodate additional modalities seamlessly, expanding its applicability without major modifications.

Key numerical results demonstrate FullDiT's superiority in managing multiple video generation tasks simultaneously. In empirical evaluations against standard benchmarks, FullDiT consistently achieved state-of-the-art results across visual quality, condition fidelity, and task flexibility metrics. The introduction of FullBench, a comprehensive evaluation suite, further highlights FullDiT's proficiency in multi-condition settings, offering the first dedicated benchmark for such complex assessments.

On the theoretical side, FullDiT's unified approach paves the way for future exploration into more integrated and adaptable generative models. Its architecture is an indication of the potential direction for video generation frameworks, which may continue to evolve toward handling multimodal conditions with greater efficiency and effectiveness. Practically, FullDiT's capacity to incorporate multiple simultaneous inputs with minimal overhead suggests probable advances in fields requiring precise video manipulation, ranging from automated filmmaking to interactive digital content creation.

The implications of this research are manifold. FullDiT highlights the potential for adapting multi-task learning principles to generative models, representing a shift from siloed adapter designs to unified frameworks that accommodate wide-ranging inputs. As further integrations of different modalities and conditions into generative models become feasible, FullDiT exemplifies the potential for AI-driven applications to become more robust, versatile, and aligned with complex real-world demands.

In summary, FullDiT presents a formidable case paper in overcoming traditional video generation limitations, leveraging full attention to comprehensively manage diverse input conditions. Its contribution is both practical, in providing a scalable solution for creative industries, and theoretical, suggesting new directions for research into unified generative architectures that balance computational efficiency with impressive output quality.