Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion (2402.03162v2)

Published 5 Feb 2024 in cs.CV

Abstract: Recent text-to-video diffusion models have achieved impressive progress. In practice, users often desire the ability to control object motion and camera movement independently for customized video creation. However, current methods lack the focus on separately controlling object motion and camera movement in a decoupled manner, which limits the controllability and flexibility of text-to-video models. In this paper, we introduce Direct-a-Video, a system that allows users to independently specify motions for multiple objects as well as camera's pan and zoom movements, as if directing a video. We propose a simple yet effective strategy for the decoupled control of object motion and camera movement. Object motion is controlled through spatial cross-attention modulation using the model's inherent priors, requiring no additional optimization. For camera movement, we introduce new temporal cross-attention layers to interpret quantitative camera movement parameters. We further employ an augmentation-based approach to train these layers in a self-supervised manner on a small-scale dataset, eliminating the need for explicit motion annotation. Both components operate independently, allowing individual or combined control, and can generalize to open-domain scenarios. Extensive experiments demonstrate the superiority and effectiveness of our method. Project page and code are available at https://direct-a-video.github.io/.

Authors (8)

Shiyuan Yang (5 papers)
Liang Hou (24 papers)
Haibin Huang (60 papers)
Chongyang Ma (52 papers)
Pengfei Wan (86 papers)
Di Zhang (231 papers)
Xiaodong Chen (31 papers)
Jing Liao (100 papers)

Citations (44)

View on Semantic Scholar

Summary

The paper introduces a novel system that independently controls camera movements and object trajectories in text-to-video diffusion models.
It employs self-supervised temporal cross-attention layers for camera simulation and training-free spatial attention for drawing object paths.
Empirical results show improved FID-vid and FVD scores, highlighting the model's capacity for producing high-quality, motion-controlled videos.

Introduction

In the field of text-to-video (T2V) diffusion models, there has been remarkable progress, especially in generating videos with controllable elements. However, a significant limitation in the current scope of technology has been the independent control of two critical aspects: object motion and camera movement. This paper introduces Direct-a-Video, a novel system which empowers users to direct video generation by independently manipulating camera movement as well as the motion of one or multiple objects within the video.

Camera Movement Control

The paper presents a sophisticated approach, where camera movements such as panning and zooming are parameterized and controlled using self-supervised training techniques. Notably, the introduction of temporal cross-attention layers, termed the camera module, is key to this development. These layers interpret camera movement parameters during training, employing augmentation methods to simulate camera motion without the need for explicit data annotation. The effectiveness of this technique is reflected in numerical results, with lower Flow error scores indicating precise camera control.

Object Motion Control

Direct-a-Video also showcases a training-free strategy for object motion control, which is particularly intriguing given the scarcity of video datasets with detailed motion annotations. By modulating spatial cross-attention mechanisms within the T2V model, the system allows users to guide the trajectory of objects simply by drawing bounding boxes on key frames, which is then interpolated by the model to create the motion path. This negates the need for intensive data collection and enables finer control over the motion of multiple objects simultaneously.

Empirical Validation

Extensive experiments validate the model's superiority and its ability for individualized video creation. It outperforms existing models in several quantitative measures, including FID-vid and FVD scores, establishing its prowess in generating high-quality, motion-controlled videos. Moreover, the paper shows that the model's capacity for decoupled control does not compromise visual continuity or the intrinsic coherence of generated videos.

Conclusion

Direct-a-Video stands out for its unique ability to decouple and manipulate camera and object motion independently in the video generation process. It offers a level of precision and user-friendliness that marks a significant advancement in generative video modeling. This leap opens new doors for users to create engaging and dynamic content, mimicking the nuanced control of a film director. Future directions could explore refining the conflict resolution between camera and object inputs and expanding the capabilities for more complex 3D camera movements.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/_akhaliq/status/1754703720472846715