Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects (2401.09962v2)

Published 18 Jan 2024 in cs.CV

Abstract: Customized text-to-video generation aims to generate high-quality videos guided by text prompts and subject references. Current approaches for personalizing text-to-video generation suffer from tackling multiple subjects, which is a more challenging and practical scenario. In this work, our aim is to promote multi-subject guided text-to-video customization. We propose CustomVideo, a novel framework that can generate identity-preserving videos with the guidance of multiple subjects. To be specific, firstly, we encourage the co-occurrence of multiple subjects via composing them in a single image. Further, upon a basic text-to-video diffusion model, we design a simple yet effective attention control strategy to disentangle different subjects in the latent space of diffusion model. Moreover, to help the model focus on the specific area of the object, we segment the object from given reference images and provide a corresponding object mask for attention learning. Also, we collect a multi-subject text-to-video generation dataset as a comprehensive benchmark, with 63 individual subjects from 13 different categories and 68 meaningful pairs. Extensive qualitative, quantitative, and user study results demonstrate the superiority of our method compared to previous state-of-the-art approaches. The project page is https://kyfafyd.wang/projects/customvideo.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Conditional gan with discriminative filter generation for text-to-video synthesis. In IJCAI, volume 1, page 2, 2019.
  2. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv, 2023.
  3. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, pages 22563–22575, 2023.
  4. Emerging properties in self-supervised vision transformers. In ICCV, pages 9650–9660, 2021.
  5. Videocrafter1: Open diffusion models for high-quality video generation. arXiv, 2023.
  6. Videodreamer: Customized multi-subject text-to-video generation with disen-mix finetuning. arXiv, 2023.
  7. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  8. Diffsynth: Latent in-iteration deflickering for realistic video synthesis. arXiv, 2023.
  9. Structure and content-guided video synthesis with diffusion models. In ICCV, pages 7346–7356, 2023.
  10. An image is worth one word: Personalizing text-to-image generation using textual inversion. In ICLR, 2023.
  11. Tokenflow: Consistent diffusion features for consistent video editing. arXiv, 2023.
  12. Accelerate: Training and inference at scale made simple, efficient and adaptable. https://github.com/huggingface/accelerate, 2022.
  13. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv, 2023.
  14. Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv, 2022.
  15. Classifier-free diffusion guidance. arXiv, 2022.
  16. Denoising diffusion probabilistic models. NeurIPS, 33:6840–6851, 2020.
  17. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv, 2022.
  18. LoRA: Low-rank adaptation of large language models. In ICLR, 2022.
  19. Videobooth: Diffusion-based video generation with image prompts. arXiv, 2023.
  20. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. ICCV, 2023.
  21. Segment anything. ICCV, 2023.
  22. Videopoet: A large language model for zero-shot video generation. arXiv, 2023.
  23. Multi-concept customization of text-to-image diffusion. In CVPR, pages 1931–1941, 2023.
  24. Cones: Concept neurons in diffusion models for customized generation. ICML, 2023.
  25. Cones 2: Customizable image synthesis with multiple subjects. NeurIPS, 2023.
  26. Decoupled weight decay regularization. arXiv, 2017.
  27. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. NeurIPS, 35:5775–5787, 2022.
  28. Fatezero: Fusing attentions for zero-shot text-based video editing. ICCV, 2023.
  29. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
  30. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
  31. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, pages 22500–22510, 2023.
  32. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv, 2021.
  33. Make-a-video: Text-to-video generation without text-video data. In ICLR, 2023.
  34. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In CVPR, pages 3626–3636, 2022.
  35. Denoising diffusion implicit models. arXiv, 2020.
  36. Phenaki: Variable length video generation from open domain textual descriptions. In ICLR, 2023.
  37. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022.
  38. Modelscope text-to-video technical report. arXiv, 2023.
  39. Videocomposer: Compositional video synthesis with motion controllability. arXiv, 2023.
  40. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv, 2023.
  41. Dreamvideo: Composing your dream videos with customized subject and motion. arXiv, 2023.
  42. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In ICCV, 2023.
  43. Make pixels dance: High-dynamic video generation. arXiv, 2023.
  44. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv, 2023.
  45. I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv, 2023.
  46. Videoassembler: Identity-consistent video generation with reference entities using diffusion model. arXiv, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Zhao Wang (155 papers)
  2. Aoxue Li (22 papers)
  3. Lingting Zhu (20 papers)
  4. Yong Guo (67 papers)
  5. Qi Dou (163 papers)
  6. Zhenguo Li (195 papers)
Citations (21)

Summary

  • The paper introduces CustomVideo, a framework that preserves the identities of multiple subjects in text-to-video synthesis.
  • It employs an attention control strategy with ground-truth object masks and fine-tuning to harmonize subject co-occurrence during training.
  • Empirical results on the CustomStudio dataset show enhanced textual and image alignment with improved temporal consistency over prior methods.

Introduction

Text-to-video generation with the presence of multiple subjects is the focal point of this research paper. Current text-to-video (T2V) models are adept at handling single subjects, but complications arise when dealing with videos featuring multiple subjects. Issues like maintaining identity consistency and ensuring the simultaneous appearance of subjects pose a challenge in the field. To address these challenges, a new framework known as CustomVideo has been introduced, offering remarkable proficiency in creating identity-preserved videos that interpret multiple subjects in response to text prompts.

Novel Framework: CustomVideo

CustomVideo steps forward as a pioneering framework designed for multi-subject text-to-video generation. It cements itself as an advancement over prior works by adopting a unique approach to harmonize multiple subjects within a video. The framework ensures this co-occurrence during the model training phase, thus predisposing the model to preserve subject identities effectively during video inference. To further augment this process, CustomVideo introduces an attention control strategy which employs ground-truth object masks to fine-tune the focus of the model on the target areas. The attention mechanism is engineered to disentangle different subjects' identities while capturing their unique features convincingly.

Dataset Creation and Methodology

A significant contribution of the paper is the formulation of a new dataset labeled CustomStudio, which includes 69 individual subjects and 57 pairs, extending far beyond the existing benchmarks. It provides a rich array of subject categories, setting a comprehensive benchmark for the text-to-video conversion task. During model training, the authors propose a fine-tuning process that fosters an understanding of co-occurrence by uniting multiple subjects in a single image. They employ a segmentation model or human annotators to craft precise object masks, which serve as tools for learning attention across different image segments.

Results and Comparative Analysis

CustomVideo has been rigorously tested against state-of-the-art methods, yielding inspiring results that delineate its superiority in both qualitative and quantitative terms. The paper discusses empirical evaluations using metrics like Textual Alignment, Image Alignment, and Temporal Consistency, which conclusively point towards CustomVideo's enhanced performance. Specific examples illustrated in the research show that while other approaches struggle to maintain consistency or capture accurate visual details, CustomVideo excels, delivering high-quality videos that align closely with text prompts and subject identities.

Conclusion

In conclusion, CustomVideo emerges as a robust method for multi-subject text-to-video generation, proficiently overcoming problems associated with subject identity preservation and coherence in generated videos. Its methodological innovations, alongside the new dataset, pave the way for future research and applications in personalized video generation. The research evidences the capability of CustomVideo by offering both quantitative and qualitative comparisons, strengthening its case as a significant leap forward in the domain of artificial intelligence and video production.