VideoStudio: Generating Consistent-Content and Multi-Scene Videos (2401.01256v2)
Abstract: The recent innovations and breakthroughs in diffusion models have significantly expanded the possibilities of generating high-quality videos for the given prompts. Most existing works tackle the single-scene scenario with only one video event occurring in a single background. Extending to generate multi-scene videos nevertheless is not trivial and necessitates to nicely manage the logic in between while preserving the consistent visual appearance of key content across video scenes. In this paper, we propose a novel framework, namely VideoStudio, for consistent-content and multi-scene video generation. Technically, VideoStudio leverages LLMs (LLM) to convert the input prompt into comprehensive multi-scene script that benefits from the logical knowledge learnt by LLM. The script for each scene includes a prompt describing the event, the foreground/background entities, as well as camera movement. VideoStudio identifies the common entities throughout the script and asks LLM to detail each entity. The resultant entity description is then fed into a text-to-image model to generate a reference image for each entity. Finally, VideoStudio outputs a multi-scene video by generating each scene video via a diffusion process that takes the reference images, the descriptive prompt of the event and camera movement into account. The diffusion model incorporates the reference images as the condition and alignment to strengthen the content consistency of multi-scene videos. Extensive experiments demonstrate that VideoStudio outperforms the SOTA video generation models in terms of visual quality, content consistency, and user preference. Source code is available at \url{https://github.com/FuchenUSTC/VideoStudio}.
- Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval. In ICCV, 2021.
- Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models. In CVPR, 2023.
- Quo Vadis, Action Recognition? A New Model and The Kinetics Dataset. In CVPR, 2017.
- Diffusion Models Beat GANs on Image Synthesis. In NeurIPS, 2021.
- GLM: General Language Model Pretraining with Autoregressive Blank Infilling. In ACL, 2022.
- Structure and Content-guided Video Synthesis with Diffusion Models. In ICCV, 2023.
- TokenFlow: Consistent Diffusion Features for Consistent Video Editing. arXiv preprint arXiv:2307.10373, 2023.
- AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning. arXiv preprint arXiv:2307.04725, 2023.
- Latent Video Diffusion Models for High-Fidelity Long Video Generation. arXiv preprint arXiv:2211.13221, 2022.
- Gans Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In NIPS, 2017.
- Classifier-Free Diffusion Guidance. arXiv preprint arXiv:2207.12598, 2022.
- Denoising Diffusion Probabilistic Models. In NeurIPS, 2020.
- Imagen Video: High Definition Video Generation with Diffusion Models. arXiv preprint arXiv:2210.02303, 2022.
- CogVideo: Large-Scale Pretraining for Text-to-Video Generation via Transformers. In ICLR, 2023.
- VideoControlNet: A Motion-Guided Video-to-Video Translation Framework by Using Diffusion Model with ControlNet. arXiv preprint arXiv:2307.14073, 2023.
- Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators. In ICCV, 2023.
- DeepStory: Video Story QA by Deep Embedded Memory Networks. In IJCAI, 2017.
- Dense-Captioning Events in Videos. In ICCV, 2017.
- StoryGAN: A Sequential Conditional GAN for Story Visualization. In CVPR, 2019.
- Contextual Transformer Networks for Visual Recognition. IEEE Trans. on PAMI, 2022.
- NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis. In NeurIPS, 2022.
- VideoDirectorGPT: Consistent Multi-Scene Video Generation via LLM-Guided Planning. arXiv preprint arXiv:2309.15091, 2023.
- Grounding DINO: Marrying Dino with Grounded Pre-Training for Open-Set Object Detection. arXiv preprint arXiv:2303.05499, 2023.
- Gaussian Temporal Awareness Networks for Action Localization. In CVPR, 2019.
- Stand-Alone Inter-Frame Attention in Video Models. In CVPR, 2022a.
- Dynamic Temporal Filtering in Video Models. In ECCV, 2022b.
- Bi-calibration Networks for Weakly-Supervised Video Representation Learning. IJCV, 2023.
- DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps. In NeurIPS, 2022.
- DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models. arXiv preprint arXiv:2211.01095, 2023.
- VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation. In CVPR, 2023.
- T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models. arXiv preprint arXiv:2302.08453, 2023.
- Improved Denoising Diffusion Probabilistic Models. In ICML, 2021.
- GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. In ICML, 2022.
- OpenAI. GPT-4 Technical Report, 2023.
- CoDeF: Content Deformation Fields for Temporally Consistent Video Processing. arXiv preprint arXiv:2308.07926, 2023.
- FateZero: Fusing Attentions for Zero-shot Text-based Video Editing. In ICCV, 2023.
- U2-Net: Going Deeper with Nested U-Structure for Salient Object Detection. Pattern Recognition, 2020.
- Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks. In ICCV, 2017.
- Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv preprint arXiv:2204.06125, 2022.
- High-Resolution Image Synthesis with Latent Diffusion Models. In CVPR, 2022.
- Laion-5B: An Open Large-Scale Dataset for Training Next Generation Image-Text Models. In NeurIPS, 2022.
- Edit-A-Video: Single Video Editing with Object-Aware Consistency. arXiv preprint arXiv:2303.07945, 2023.
- Make-a-video: Text-to-Video Generation without Text-Video Data. arXiv preprint arXiv:2209.14792, 2022.
- Deep Unsupervised Learning using Nonequilibrium Thermodynamics. In ICML, 2015.
- Denoising Diffusion Implicit Models. In ICLR, 2021.
- Generative Modeling by Estimating Gradients of the Data Distribution. In NeurIPS, 2019.
- FVD: A New Metric for Video Generation. In ICLR Workshop, 2019.
- Phenaki: Variable Length Video Generation from Open Domain Textual Description. In ICLR, 2023.
- MCVD-Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation. In NeurIPS, 2022.
- Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising. arXiv preprint arXiv:2305.18264, 2023a.
- ModelScope Text-to-Video Technical Report. arXiv preprint arXiv:2308.06571, 2023b.
- VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation. arXiv preprint arXiv:2305.10874, 2023c.
- VideoComposer: Compositional Video Synthesis with Motion Controllability. In NeurIPS, 2023d.
- GODIVA: Generating Open-Domain Videos from Natural Descriptions. arXiv preprint arXiv:2104.14806, 2021.
- Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation. In ICCV, 2023.
- MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In CVPR, 2016.
- Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning. In ECCV, 2022.
- Dual Vision Transformer. IEEE Trans. on PAMI, 2023.
- Dragnuwa: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory. arXiv preprint arXiv:2308.08089, 2023a.
- NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation. arXiv preprint arXiv:2303.12346, 2023b.
- GLM-130B: An Open Bilingual Pre-Trained Model. arXiv preprint arXiv:2210.02414, 2022.
- Adding Conditional Control to Text-to-Image Diffusion Models. In ICCV, 2023.
- Magicvideo: Efficient Video Generation with Latent Diffusion Models. arXiv preprint arXiv:2211.11018, 2022.
- Fuchen Long (13 papers)
- Zhaofan Qiu (37 papers)
- Ting Yao (127 papers)
- Tao Mei (209 papers)