Phenaki: Variable Length Video Generation From Open Domain Textual Description (2210.02399v1)

Published 5 Oct 2022 in cs.CV and cs.AI

Abstract: We present Phenaki, a model capable of realistic video synthesis, given a sequence of textual prompts. Generating videos from text is particularly challenging due to the computational cost, limited quantities of high quality text-video data and variable length of videos. To address these issues, we introduce a new model for learning video representation which compresses the video to a small representation of discrete tokens. This tokenizer uses causal attention in time, which allows it to work with variable-length videos. To generate video tokens from text we are using a bidirectional masked transformer conditioned on pre-computed text tokens. The generated video tokens are subsequently de-tokenized to create the actual video. To address data issues, we demonstrate how joint training on a large corpus of image-text pairs as well as a smaller number of video-text examples can result in generalization beyond what is available in the video datasets. Compared to the previous video generation methods, Phenaki can generate arbitrary long videos conditioned on a sequence of prompts (i.e. time variable text or a story) in open domain. To the best of our knowledge, this is the first time a paper studies generating videos from time variable prompts. In addition, compared to the per-frame baselines, the proposed video encoder-decoder computes fewer tokens per video but results in better spatio-temporal consistency.

PDF Abstract

Phenaki: Variable-Length Video Generation from Open Domain Textual Descriptions

The paper introduces "Phenaki," a novel approach for generating realistic variable-length videos conditioned on sequences of open domain textual descriptions. It tackles a critical challenge in video synthesis: generating coherent temporal sequences from text inputs, which to date has been less explored compared to image synthesis. The core contribution of Phenaki lies in its ability to handle variable-length videos through a sophisticated tokenizer that employs causal attention, enabling the compression of videos into discrete token representations. This capability is particularly noteworthy as it distinguishes video from mere "moving images" and facilitates real-world creative applications.

Core Methodology

The architecture of Phenaki consists of two primary components: an encoder-decoder video model and a bidirectional masked transformer, inspired by recent advancements in image synthesis such as DALL-E and Parti. The encoder-decoder model, termed C, is pivotal for its ability to exploit temporal redundancies and encode video sequences into compact representations through discrete video tokens. This compression leads to fewer tokens, approximately 40% less than baseline methods, without sacrificing spatio-temporal coherence. The C model's novel causal structure allows for decoding videos of any length, a feature supported by its auto-regressive capabilities.

The bidirectional masked transformer component, trained with movie visual token modeling, efficiently generates video tokens from pre-computed text tokens. This part of the architecture eschews the typical auto-regressive token-by-token generation in favor of a parallel approach, considerably reducing computational overhead and allowing for rapid video generation. The successive prediction of masked tokens aligns with state-of-the-art practices in image generation, leveraging mask and sampling schedules to enhance quality.

Experimental Results and Implications

Phenaki demonstrates its efficacy through a series of evaluations, including text-to-video generation, story-based video synthesis, and image-conditional video generation. This broad testing reaffirms its utility across different settings, outperforming previous methods in maintaining temporal coherence, especially in longer videos. The combined training on both image and video datasets allows the model to generalize and compose new concepts not inherently available in traditional video datasets. This is crucial given the limited availability of high-quality video-text pairs compared to extensive image collections like LAION-400M.

Quantitatively, the model shows competitive performance in text-to-video tasks, closely matching fine-tuned models on Kinetics-400 despite being evaluated zero-shot. Furthermore, the integration of text-image data significantly enhances video generation capability, particularly in capturing diverse styles and narratives unattainable through video data alone. These results suggest that leveraging comprehensive sets of images contributes to increased video generation quality and broad applicability across themes.

Implications and Future Directions

Phenaki represents an essential stride towards flexible and scalable text-conditioned video generation. Its potential for creating dynamic visual narratives from textual stories opens new avenues in content creation and digital storytelling, fundamentally transforming how visual content can be conceived and produced. The proposed framework paves the way for further paper into more efficient and contextually aware video generation systems.

Future developments may delve into enhancing model interpretability to refine control over generated outputs, addressing inherent biases from training datasets, and exploring better fine-tuning strategies to improve generalization across unseen text inputs. Moreover, expanding the dataset variegation to incorporate more real-world scenarios could vastly improve Phenaki's efficacy, augmenting its utility in diverse applications from creative industries to educational tools.