CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects (2401.09962v2)
Abstract: Customized text-to-video generation aims to generate high-quality videos guided by text prompts and subject references. Current approaches for personalizing text-to-video generation suffer from tackling multiple subjects, which is a more challenging and practical scenario. In this work, our aim is to promote multi-subject guided text-to-video customization. We propose CustomVideo, a novel framework that can generate identity-preserving videos with the guidance of multiple subjects. To be specific, firstly, we encourage the co-occurrence of multiple subjects via composing them in a single image. Further, upon a basic text-to-video diffusion model, we design a simple yet effective attention control strategy to disentangle different subjects in the latent space of diffusion model. Moreover, to help the model focus on the specific area of the object, we segment the object from given reference images and provide a corresponding object mask for attention learning. Also, we collect a multi-subject text-to-video generation dataset as a comprehensive benchmark, with 63 individual subjects from 13 different categories and 68 meaningful pairs. Extensive qualitative, quantitative, and user study results demonstrate the superiority of our method compared to previous state-of-the-art approaches. The project page is https://kyfafyd.wang/projects/customvideo.
- Conditional gan with discriminative filter generation for text-to-video synthesis. In IJCAI, volume 1, page 2, 2019.
- Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv, 2023.
- Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR, pages 22563–22575, 2023.
- Emerging properties in self-supervised vision transformers. In ICCV, pages 9650–9660, 2021.
- Videocrafter1: Open diffusion models for high-quality video generation. arXiv, 2023.
- Videodreamer: Customized multi-subject text-to-video generation with disen-mix finetuning. arXiv, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
- Diffsynth: Latent in-iteration deflickering for realistic video synthesis. arXiv, 2023.
- Structure and content-guided video synthesis with diffusion models. In ICCV, pages 7346–7356, 2023.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. In ICLR, 2023.
- Tokenflow: Consistent diffusion features for consistent video editing. arXiv, 2023.
- Accelerate: Training and inference at scale made simple, efficient and adaptable. https://github.com/huggingface/accelerate, 2022.
- Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv, 2023.
- Latent video diffusion models for high-fidelity video generation with arbitrary lengths. arXiv, 2022.
- Classifier-free diffusion guidance. arXiv, 2022.
- Denoising diffusion probabilistic models. NeurIPS, 33:6840–6851, 2020.
- Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv, 2022.
- LoRA: Low-rank adaptation of large language models. In ICLR, 2022.
- Videobooth: Diffusion-based video generation with image prompts. arXiv, 2023.
- Text2video-zero: Text-to-image diffusion models are zero-shot video generators. ICCV, 2023.
- Segment anything. ICCV, 2023.
- Videopoet: A large language model for zero-shot video generation. arXiv, 2023.
- Multi-concept customization of text-to-image diffusion. In CVPR, pages 1931–1941, 2023.
- Cones: Concept neurons in diffusion models for customized generation. ICML, 2023.
- Cones 2: Customizable image synthesis with multiple subjects. NeurIPS, 2023.
- Decoupled weight decay regularization. arXiv, 2017.
- Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. NeurIPS, 35:5775–5787, 2022.
- Fatezero: Fusing attentions for zero-shot text-based video editing. ICCV, 2023.
- Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
- High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, pages 22500–22510, 2023.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv, 2021.
- Make-a-video: Text-to-video generation without text-video data. In ICLR, 2023.
- Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In CVPR, pages 3626–3636, 2022.
- Denoising diffusion implicit models. arXiv, 2020.
- Phenaki: Variable length video generation from open domain textual descriptions. In ICLR, 2023.
- Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022.
- Modelscope text-to-video technical report. arXiv, 2023.
- Videocomposer: Compositional video synthesis with motion controllability. arXiv, 2023.
- Lavie: High-quality video generation with cascaded latent diffusion models. arXiv, 2023.
- Dreamvideo: Composing your dream videos with customized subject and motion. arXiv, 2023.
- Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In ICCV, 2023.
- Make pixels dance: High-dynamic video generation. arXiv, 2023.
- Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv, 2023.
- I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models. arXiv, 2023.
- Videoassembler: Identity-consistent video generation with reference entities using diffusion model. arXiv, 2023.
- Zhao Wang (155 papers)
- Aoxue Li (22 papers)
- Lingting Zhu (20 papers)
- Yong Guo (67 papers)
- Qi Dou (163 papers)
- Zhenguo Li (195 papers)