VideoStudio: Generating Consistent-Content and Multi-Scene Videos (2401.01256v2)

Published 2 Jan 2024 in cs.CV and cs.CL

Abstract: The recent innovations and breakthroughs in diffusion models have significantly expanded the possibilities of generating high-quality videos for the given prompts. Most existing works tackle the single-scene scenario with only one video event occurring in a single background. Extending to generate multi-scene videos nevertheless is not trivial and necessitates to nicely manage the logic in between while preserving the consistent visual appearance of key content across video scenes. In this paper, we propose a novel framework, namely VideoStudio, for consistent-content and multi-scene video generation. Technically, VideoStudio leverages LLMs (LLM) to convert the input prompt into comprehensive multi-scene script that benefits from the logical knowledge learnt by LLM. The script for each scene includes a prompt describing the event, the foreground/background entities, as well as camera movement. VideoStudio identifies the common entities throughout the script and asks LLM to detail each entity. The resultant entity description is then fed into a text-to-image model to generate a reference image for each entity. Finally, VideoStudio outputs a multi-scene video by generating each scene video via a diffusion process that takes the reference images, the descriptive prompt of the event and camera movement into account. The diffusion model incorporates the reference images as the condition and alignment to strengthen the content consistency of multi-scene videos. Extensive experiments demonstrate that VideoStudio outperforms the SOTA video generation models in terms of visual quality, content consistency, and user preference. Source code is available at \url{https://github.com/FuchenUSTC/VideoStudio}.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (63)

Authors (4)

Fuchen Long (13 papers)
Zhaofan Qiu (37 papers)
Ting Yao (127 papers)
Tao Mei (209 papers)

Citations (15)

View on Semantic Scholar

Tweets

https://twitter.com/WilliamLamkin/status/1748128415549378762

https://twitter.com/267523967/status/1742979047750263009

https://twitter.com/javaeeeee1/status/1743624074859860222

VideoStudio: Generating Consistent-Content and Multi-Scene Videos (2401.01256v2)

Related Papers

Tweets