Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

102 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

4 4

VideoPoet: A Large Language Model for Zero-Shot Video Generation (2312.14125v4)

Published 21 Dec 2023 in cs.CV and cs.AI

Abstract: We present VideoPoet, a LLM capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs -- including images, videos, text, and audio. The training protocol follows that of LLMs, consisting of two stages: pretraining and task-specific adaptation. During pretraining, VideoPoet incorporates a mixture of multimodal generative objectives within an autoregressive Transformer framework. The pretrained LLM serves as a foundation that can be adapted for a range of video generation tasks. We present empirical results demonstrating the model's state-of-the-art capabilities in zero-shot video generation, specifically highlighting VideoPoet's ability to generate high-fidelity motions. Project page: http://sites.research.google/videopoet/

PDF HTML Abstract

An Overview of VideoPoet: A LLM for Zero-Shot Video Generation

The paper introduces VideoPoet, a novel approach to video generation leveraging LLMs to accomplish a diverse set of video generation tasks, conditioned on various multimodal inputs including image, video, text, and audio. The authors propose a unique framework that bridges the gap between LLMs and video generation, areas predominantly led by diffusion models.

Architectural Innovation

VideoPoet employs a decoder-only transformer architecture, a common choice in LLMs, to process and generate videos. This setup includes three core components:

Tokenizers: Modality-specific tokenizers convert inputs into discrete tokens. The MAGVIT-v2 tokenizer is used for images and videos, while the SoundStream tokenizer manages audio inputs. This unified vocabulary allows the model to directly process different modalities.
LLM Backbone: A prefix LLM forms the central component where task-specific prefixes guide the video generation process.
Super-Resolution Module: This addition refines the fidelity of generated video outputs, improving spatial resolutions and detail coherence.

Pretraining and Task Adaptation

The training regime for VideoPoet involves two principal stages: pretraining on large-scale unsupervised data with a variety of video tasks, followed by specific task adaptation. The approach emphasizes effective integration and transformation of foundational tasks from LLMs, pretraining multi-modal models to adapt effectively to video generation tasks.

Experimental Demonstrations

The model showcases its versatility across various tasks such as text-to-video (T2V), image-to-video (I2V), video future prediction, and video stylization. Notably, the model demonstrates:

Coherent Long-Video Generation: By iteratively generating video segments based on previous outputs, VideoPoet can generate content extending beyond simple short clips.
Zero-shot Video Editing and Chaining of Tasks: The model uniquely combines capabilities to perform non-pretrained specific tasks through chaining, incorporating editing functionalities seamlessly.

Comparative Evaluation

Performance evaluations reflect VideoPoet's competitive edge against existing methods on standard benchmarks like MSR-VTT and UCF-101 without task-specific fine-tuning. Human evaluations further validate its superiority in generating interesting and realistic motion compared to state-of-the-art diffusion models, although some discrepancies remain in crisp aesthetic and text fidelity due to training data choice.

Implications and Future Directions

VideoPoet positions LLMs as viable alternatives to diffusion models for multimodal video generation. By leveraging a unified token framework, VideoPoet introduces flexibility and scalability in handling diverse video tasks. A notable future direction includes enhancing fine-grained details in video outputs and addressing biases in representations as noted in datasets. Adapting such a model potentially evolves into areas requiring intricate narrative constructions and cross-domain video synthesis, widening the application scope.

In conclusion, VideoPoet exemplifies the adaptability of LLMs to video content generation, combining state-of-the-art fidelity with multi-task versatility, establishing a promising foundation for future developments in video-driven applications and AI systems.

PDF Markdown Bookmark Chat (Pro)

References (86)

Authors (31)

Dan Kondratyuk (11 papers)
Lijun Yu (22 papers)
Xiuye Gu (17 papers)
José Lezama (19 papers)
Jonathan Huang (46 papers)
Rachel Hornung (4 papers)
Hartwig Adam (49 papers)
Hassan Akbari (8 papers)
Yair Alon (3 papers)
Vighnesh Birodkar (16 papers)
Yong Cheng (58 papers)
Ming-Chang Chiu (11 papers)
Josh Dillon (3 papers)
Irfan Essa (91 papers)
Agrim Gupta (26 papers)
Meera Hahn (15 papers)
Anja Hauth (6 papers)
David Hendon (2 papers)
Alonso Martinez (2 papers)
David Minnen (19 papers)

Citations (147)

View on Semantic Scholar

Tweets

https://twitter.com/hyungjin_chung/status/1763477701938295076

https://twitter.com/ImenKedir/status/1790955477754609751

https://twitter.com/1711473040918450176/status/1742672975386812824

https://twitter.com/49489967/status/1742670183851008082

YouTube

Show All Videos