CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities (2408.13239v1)

Published 23 Aug 2024 in cs.CV

Abstract: Customized video generation aims to generate high-quality videos guided by text prompts and subject's reference images. However, since it is only trained on static images, the fine-tuning process of subject learning disrupts abilities of video diffusion models (VDMs) to combine concepts and generate motions. To restore these abilities, some methods use additional video similar to the prompt to fine-tune or guide the model. This requires frequent changes of guiding videos and even re-tuning of the model when generating different motions, which is very inconvenient for users. In this paper, we propose CustomCrafter, a novel framework that preserves the model's motion generation and conceptual combination abilities without additional video and fine-tuning to recovery. For preserving conceptual combination ability, we design a plug-and-play module to update few parameters in VDMs, enhancing the model's ability to capture the appearance details and the ability of concept combinations for new subjects. For motion generation, we observed that VDMs tend to restore the motion of video in the early stage of denoising, while focusing on the recovery of subject details in the later stage. Therefore, we propose Dynamic Weighted Video Sampling Strategy. Using the pluggability of our subject learning modules, we reduce the impact of this module on motion generation in the early stage of denoising, preserving the ability to generate motion of VDMs. In the later stage of denoising, we restore this module to repair the appearance details of the specified subject, thereby ensuring the fidelity of the subject's appearance. Experimental results show that our method has a significant improvement compared to previous methods.

Authors (8)

Tao Wu (127 papers)
Yong Zhang (660 papers)
Xintao Wang (132 papers)
Xianpan Zhou (6 papers)
Guangcong Zheng (10 papers)
Zhongang Qi (40 papers)
Ying Shan (252 papers)
Xi Li (198 papers)

Citations (5)

View on Semantic Scholar

Summary

CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities

Overview

The paper introduces CustomCrafter, an advanced framework designed to enhance customized video generation. This framework addresses key deficiencies in Video Diffusion Models (VDMs) related to concept combination and motion generation, particularly when the model is fine-tuned for subject-specific video synthesis. Through the introduction of two main methodological advances—Spatial Subject Learning Module (SSLM) and Dynamic Weighted Video Sampling Strategy (DWVSS)—the paper showcases significant improvements in generating high-quality, subject-specific videos without the need for additional video guidance or extensive model re-tuning.

Contributions

Spatial Subject Learning Module (SSLM)

The SSLM aims to enhance the model’s ability to capture and combine the appearance of novel subjects with other concepts. Current methods often update only cross-attention parameters, which limits the model's capacity to integrate new subjects effectively. SSLM instead updates both spatial cross-attention and self-attention layers. This approach leverages Low-Rank Adaptation (LoRA) to fine-tune the parameters, thereby maintaining the geometric and shape details central to concept combinations. The pluggable design of these modules allows for selective application during different stages of the denoising process, making them effective in both capturing new subject details and preserving concept combination abilities.

Dynamic Weighted Video Sampling Strategy (DWVSS)

DWVSS addresses the decline in motion generation observed in fine-tuning processes. Based on the observation that motion details form early in the denoising process, while appearance details are refined later, the strategy temporarily reduces the influence of the SSLM during the early stages. This preservation of motion generation ability is achieved by dynamically adjusting the weight of the LoRA layers between a smaller value ( $\lambda_s$ ) for early denoising steps and a larger value ( $\lambda_l$ ) for later steps. This adaptive approach ensures high-quality subject appearance details while maintaining fluid motion.

Experimental Validation

The paper demonstrated CustomCrafter's efficacy through extensive quantitative and qualitative comparisons against existing methodologies, such as Custom Diffusion and DreamVideo. The evaluation utilized metrics like CLIP-T, CLIP-I, DINO-I, and Temporal Consistency, where CustomCrafter consistently showed superior performance, particularly in subject fidelity and concept combination abilities. User studies further corroborated these findings, highlighting the method's enhanced ability to generate videos closely aligned with textual prompts and desired subject appearances.

Implications and Future Directions

The introduction of SSLM and DWVSS in CustomCrafter has practical and theoretical implications:

Practical Implications:
- User Convenience: By eliminating the need for additional video guidance for each new prompt, CustomCrafter simplifies the user experience without compromising the quality of generated videos.
- Computation Efficiency: The reduced need for fine-tuning and video retrieval lowers computational overhead, making the approach more efficient and scalable.
Theoretical Implications:
- Attention Mechanisms: Extending the fine-tuning to include self-attention and spatial cross-attention layers highlights the significant role these mechanisms play in high-quality, conceptually coherent video generation.
- Denoising Process: The proposed sampling strategy enriches the understanding of the denoising process, presenting a novel way to balance motion and appearance recovery.

Conclusion

CustomCrafter substantively improves the state of customized video generation by integrating advanced mechanisms to preserve and enhance motion and concept combination abilities in VDMs. This framework not only enhances video quality and fluency but also simplifies the custom video generation process by reducing reliance on additional data and repetitive fine-tuning. Future research may explore further optimization of the SSLM and DWVSS parameters to adapt to a broader variety of subjects and prompts, potentially extending the framework's applicability to diverse generative tasks in AI.

References

Tao Wu, Yong Zhang, Xintao Wang, Xianpan Zhou, Guangcong Zheng, Zhongang Qi, Ying Shan, Xi Li. "CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities."
Related foundational and contemporaneous works as cited in the original paper.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1827889559511671119

https://twitter.com/gm8xx8/status/1827882820733153500

https://twitter.com/arXivGPT/status/1828531188895506499

https://twitter.com/gpbhupinder/status/1828062880769462360

https://twitter.com/javaeeeee1/status/1830206652689060011

https://twitter.com/javaeeeee1/status/1828180775298064735