MotionBooth: Motion-Aware Customized Text-to-Video Generation (2406.17758v3)

Published 25 Jun 2024 in cs.CV

Abstract: In this work, we present MotionBooth, an innovative framework designed for animating customized subjects with precise control over both object and camera movements. By leveraging a few images of a specific object, we efficiently fine-tune a text-to-video model to capture the object's shape and attributes accurately. Our approach presents subject region loss and video preservation loss to enhance the subject's learning performance, along with a subject token cross-attention loss to integrate the customized subject with motion control signals. Additionally, we propose training-free techniques for managing subject and camera motions during inference. In particular, we utilize cross-attention map manipulation to govern subject motion and introduce a novel latent shift module for camera movement control as well. MotionBooth excels in preserving the appearance of subjects while simultaneously controlling the motions in generated videos. Extensive quantitative and qualitative evaluations demonstrate the superiority and effectiveness of our method. Our project page is at https://jianzongwu.github.io/projects/motionbooth

Authors (8)

Jianzong Wu (11 papers)
Xiangtai Li (128 papers)
Yanhong Zeng (23 papers)
Jiangning Zhang (102 papers)
Qianyu Zhou (40 papers)
Yining Li (29 papers)
Yunhai Tong (69 papers)
Kai Chen (512 papers)

Citations (12)

View on Semantic Scholar

Summary

An Expert Review of "MotionBooth: Motion-Aware Customized Text-to-Video Generation"

The paper "MotionBooth" introduces an advanced framework aimed at enhancing text-to-video (T2V) generation by incorporating motion awareness specific to customized subjects. This framework seeks to address the dual challenge of preserving the subject's fidelity while simultaneously injecting nuanced object and camera movements. Authored by Jianzong Wu and colleagues, the paper is a noteworthy contribution to the field of deep learning-based video generation.

Overview and Methodology

The core approach of MotionBooth leverages a base T2V diffusion model, fine-tuned using a few images to capture the target object's attributes accurately. The proposed framework ensures the subject's appearance is faithfully maintained while integrating motion controls during video generation.

Key Innovations

Subject Region Loss: To mitigate the challenge of background overfitting, the authors introduce a subject region loss. By focusing the diffusion reconstruction loss on the subject region alone, represented by binary masks, the model avoids learning the specific backgrounds from the training images. This technique enables the model to generalize better and produce diverse video backgrounds.
Video Preservation Loss: Recognizing that fine-tuning on images can degrade the model's video generation capability, the paper proposes a video preservation loss. By incorporating common video data rather than class-specific videos, this loss helps maintain the diverse motion prior knowledge inherent in the base T2V model while accommodating new subjects.
Subject Token Cross-Attention (STCA) Loss: To facilitate precise subject motion control during video generation, the STCA loss is introduced. This mechanism links the special token representing the customized subject to its position within the cross-attention maps, enabling explicit control during inference.
Training-Free Motion Control Techniques: During the inference phase, MotionBooth controls both subject and camera motions without additional training. Subject motion is managed by manipulating cross-attention maps, whereas a novel latent shift module governs camera movement by directly shifting the noised latent.

Empirical Validation

Quantitative Results

The experimental setup utilized different T2V models, such as Zeroscope and LaVie, to evaluate the efficacy of MotionBooth. Metrics computed include region CLIP similarity (R-CLIP), region DINO similarity (R-DINO), and flow error among others. The results demonstrate that MotionBooth outperforms state-of-the-art methods like DreamBooth, CustomVideo, and DreamVideo in several key metrics:

For Zeroscope, R-CLIP and R-DINO were 0.667 and 0.306 respectively, showing superior subject fidelity.
Flow error, an indicator of motion precision, was significantly reduced to 0.252, signifying improved camera motion fidelity.

Qualitative Results

Qualitative comparisons illustrated that MotionBooth generates videos with better subject fidelity and motion alignment, avoiding the common pitfall of over-smoothed backgrounds observed in baseline methods. Improvements in temporal consistency and video quality were particularly notable.

Implications and Future Work

Practical Implications

The proposed MotionBooth framework holds substantial promise for practical applications in personalized content creation, short films, and animated stories. The capacity to generate high-fidelity, customized video content with controlled motion can significantly reduce production costs and time, democratizing access to professional-grade video generation tools.

Theoretical Implications

The innovative loss functions and training-free motion control techniques contribute to the broader understanding of integrating subject-specific features with motion dynamics in T2V generation. These findings encourage further exploration into the modular optimization of diffusion models for multi-faceted tasks.

Future Developments

Future avenues for research include:

Enhancing the framework's ability to handle multi-object scenarios.
Exploring more sophisticated masking and segmentation techniques for improved subject-background differentiation.
Extending the framework to utilize more diverse and enriched datasets for better generalization.

Conclusion

The "MotionBooth" paper presents a sophisticated and effective approach to motion-aware, customized T2V generation, tackling significant challenges in the field with innovative solutions. The comprehensive experimental validation underlines its robustness and potential impact. This framework not only advances the state of the art but also sets a strong foundation for future research and practical implementations in AI-driven video generation.