DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control (2410.13830v1)

Published 17 Oct 2024 in cs.CV

Abstract: Recent advances in customized video generation have enabled users to create videos tailored to both specific subjects and motion trajectories. However, existing methods often require complicated test-time fine-tuning and struggle with balancing subject learning and motion control, limiting their real-world applications. In this paper, we present DreamVideo-2, a zero-shot video customization framework capable of generating videos with a specific subject and motion trajectory, guided by a single image and a bounding box sequence, respectively, and without the need for test-time fine-tuning. Specifically, we introduce reference attention, which leverages the model's inherent capabilities for subject learning, and devise a mask-guided motion module to achieve precise motion control by fully utilizing the robust motion signal of box masks derived from bounding boxes. While these two components achieve their intended functions, we empirically observe that motion control tends to dominate over subject learning. To address this, we propose two key designs: 1) the masked reference attention, which integrates a blended latent mask modeling scheme into reference attention to enhance subject representations at the desired positions, and 2) a reweighted diffusion loss, which differentiates the contributions of regions inside and outside the bounding boxes to ensure a balance between subject and motion control. Extensive experimental results on a newly curated dataset demonstrate that DreamVideo-2 outperforms state-of-the-art methods in both subject customization and motion control. The dataset, code, and models will be made publicly available.

PDF HTML Abstract

DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control

The paper introduces DreamVideo-2, a novel framework designed for zero-shot video customization that enables the generation of videos with specific subjects and precise motion trajectories. Importantly, this is achieved without the requirement for test-time fine-tuning, which is a significant limitation of existing methods.

Key Innovations

Reference Attention: The authors leverage the inherent capabilities of video diffusion models to extract multi-scale subject features through reference attention. This mechanism integrates the subject image as a single-frame video, enhancing subject identity representation during training without additional network overhead.
Mask-Guided Motion Module: Inspired by the need for precise motion control, the paper proposes a mask-guided motion module using bounding box sequences converted into binary box masks. This module, composed of a spatiotemporal encoder and a spatial ControlNet, significantly enhances motion control precision.
Masking and Loss Design: A critical challenge identified is the dominance of motion control over subject learning. To address this, the authors introduce a masked reference attention utilizing blended masks to prioritize subject representation. Additionally, a reweighted diffusion loss is employed to balance the contributions of subject learning and motion control.

Empirical Validation

DreamVideo-2 was evaluated on a newly curated dataset that is larger and more diverse than previous datasets. The framework consistently outperformed state-of-the-art methods in both subject fidelity and motion control precision. Quantitative metrics like mIoU and CD confirm superior motion control, while qualitative assessments highlight its ability to generate coherent and subject-accurate videos.

Implications and Future Directions

The results of this paper suggest several theoretical and practical implications. The approach shows the potential for significant advances in user-centric video generation applications, such as personalized content creation and interactive media. However, limitations include the challenge of decoupling camera and object movements and dealing with the inherent constraints of the base diffusion model.

Future research could explore:

Advanced Base Models: Integrating more powerful text-to-video models to capture complex scene dynamics and expand subject and motion variability.
Decoupling Motion Controls: Developing advanced mechanisms to distinguish between camera and object motions could enhance the realism and applicability of generated content.
Multi-subject and Multi-trajectory Learning: Expanding the framework to handle multiple subjects and trajectories concurrently will be crucial for broader real-world deployments.

In conclusion, DreamVideo-2 stands as a robust advancement in the framework for generating customized videos, pushing the boundaries of video customization without fine-tuning. This paper presents a balanced and effective approach, laying the groundwork for future exploration in AI-driven video generation technologies.