CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects (2401.09962v2)

Published 18 Jan 2024 in cs.CV

Abstract: Customized text-to-video generation aims to generate high-quality videos guided by text prompts and subject references. Current approaches for personalizing text-to-video generation suffer from tackling multiple subjects, which is a more challenging and practical scenario. In this work, our aim is to promote multi-subject guided text-to-video customization. We propose CustomVideo, a novel framework that can generate identity-preserving videos with the guidance of multiple subjects. To be specific, firstly, we encourage the co-occurrence of multiple subjects via composing them in a single image. Further, upon a basic text-to-video diffusion model, we design a simple yet effective attention control strategy to disentangle different subjects in the latent space of diffusion model. Moreover, to help the model focus on the specific area of the object, we segment the object from given reference images and provide a corresponding object mask for attention learning. Also, we collect a multi-subject text-to-video generation dataset as a comprehensive benchmark, with 63 individual subjects from 13 different categories and 68 meaningful pairs. Extensive qualitative, quantitative, and user study results demonstrate the superiority of our method compared to previous state-of-the-art approaches. The project page is https://kyfafyd.wang/projects/customvideo.

References (46)

Authors (6)

Zhao Wang (155 papers)
Aoxue Li (22 papers)
Lingting Zhu (20 papers)
Yong Guo (67 papers)
Qi Dou (163 papers)
Zhenguo Li (195 papers)

Citations (21)

View on Semantic Scholar

Summary

The paper introduces CustomVideo, a framework that preserves the identities of multiple subjects in text-to-video synthesis.
It employs an attention control strategy with ground-truth object masks and fine-tuning to harmonize subject co-occurrence during training.
Empirical results on the CustomStudio dataset show enhanced textual and image alignment with improved temporal consistency over prior methods.

Introduction

Text-to-video generation with the presence of multiple subjects is the focal point of this research paper. Current text-to-video (T2V) models are adept at handling single subjects, but complications arise when dealing with videos featuring multiple subjects. Issues like maintaining identity consistency and ensuring the simultaneous appearance of subjects pose a challenge in the field. To address these challenges, a new framework known as CustomVideo has been introduced, offering remarkable proficiency in creating identity-preserved videos that interpret multiple subjects in response to text prompts.

Novel Framework: CustomVideo

CustomVideo steps forward as a pioneering framework designed for multi-subject text-to-video generation. It cements itself as an advancement over prior works by adopting a unique approach to harmonize multiple subjects within a video. The framework ensures this co-occurrence during the model training phase, thus predisposing the model to preserve subject identities effectively during video inference. To further augment this process, CustomVideo introduces an attention control strategy which employs ground-truth object masks to fine-tune the focus of the model on the target areas. The attention mechanism is engineered to disentangle different subjects' identities while capturing their unique features convincingly.

Dataset Creation and Methodology

A significant contribution of the paper is the formulation of a new dataset labeled CustomStudio, which includes 69 individual subjects and 57 pairs, extending far beyond the existing benchmarks. It provides a rich array of subject categories, setting a comprehensive benchmark for the text-to-video conversion task. During model training, the authors propose a fine-tuning process that fosters an understanding of co-occurrence by uniting multiple subjects in a single image. They employ a segmentation model or human annotators to craft precise object masks, which serve as tools for learning attention across different image segments.

Results and Comparative Analysis

CustomVideo has been rigorously tested against state-of-the-art methods, yielding inspiring results that delineate its superiority in both qualitative and quantitative terms. The paper discusses empirical evaluations using metrics like Textual Alignment, Image Alignment, and Temporal Consistency, which conclusively point towards CustomVideo's enhanced performance. Specific examples illustrated in the research show that while other approaches struggle to maintain consistency or capture accurate visual details, CustomVideo excels, delivering high-quality videos that align closely with text prompts and subject identities.

Conclusion

In conclusion, CustomVideo emerges as a robust method for multi-subject text-to-video generation, proficiently overcoming problems associated with subject identity preservation and coherence in generated videos. Its methodological innovations, alongside the new dataset, pave the way for future research and applications in personalized video generation. The research evidences the capability of CustomVideo by offering both quantitative and qualitative comparisons, strengthening its case as a significant leap forward in the domain of artificial intelligence and video production.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1748180541281566948

https://twitter.com/WilliamLamkin/status/1748186976539255166