Phenaki: Variable-Length Video Generation from Open Domain Textual Descriptions
The paper introduces "Phenaki," a novel approach for generating realistic variable-length videos conditioned on sequences of open domain textual descriptions. It tackles a critical challenge in video synthesis: generating coherent temporal sequences from text inputs, which to date has been less explored compared to image synthesis. The core contribution of Phenaki lies in its ability to handle variable-length videos through a sophisticated tokenizer that employs causal attention, enabling the compression of videos into discrete token representations. This capability is particularly noteworthy as it distinguishes video from mere "moving images" and facilitates real-world creative applications.
Core Methodology
The architecture of Phenaki consists of two primary components: an encoder-decoder video model and a bidirectional masked transformer, inspired by recent advancements in image synthesis such as DALL-E and Parti. The encoder-decoder model, termed C, is pivotal for its ability to exploit temporal redundancies and encode video sequences into compact representations through discrete video tokens. This compression leads to fewer tokens, approximately 40% less than baseline methods, without sacrificing spatio-temporal coherence. The C model's novel causal structure allows for decoding videos of any length, a feature supported by its auto-regressive capabilities.
The bidirectional masked transformer component, trained with movie visual token modeling, efficiently generates video tokens from pre-computed text tokens. This part of the architecture eschews the typical auto-regressive token-by-token generation in favor of a parallel approach, considerably reducing computational overhead and allowing for rapid video generation. The successive prediction of masked tokens aligns with state-of-the-art practices in image generation, leveraging mask and sampling schedules to enhance quality.
Experimental Results and Implications
Phenaki demonstrates its efficacy through a series of evaluations, including text-to-video generation, story-based video synthesis, and image-conditional video generation. This broad testing reaffirms its utility across different settings, outperforming previous methods in maintaining temporal coherence, especially in longer videos. The combined training on both image and video datasets allows the model to generalize and compose new concepts not inherently available in traditional video datasets. This is crucial given the limited availability of high-quality video-text pairs compared to extensive image collections like LAION-400M.
Quantitatively, the model shows competitive performance in text-to-video tasks, closely matching fine-tuned models on Kinetics-400 despite being evaluated zero-shot. Furthermore, the integration of text-image data significantly enhances video generation capability, particularly in capturing diverse styles and narratives unattainable through video data alone. These results suggest that leveraging comprehensive sets of images contributes to increased video generation quality and broad applicability across themes.
Implications and Future Directions
Phenaki represents an essential stride towards flexible and scalable text-conditioned video generation. Its potential for creating dynamic visual narratives from textual stories opens new avenues in content creation and digital storytelling, fundamentally transforming how visual content can be conceived and produced. The proposed framework paves the way for further paper into more efficient and contextually aware video generation systems.
Future developments may delve into enhancing model interpretability to refine control over generated outputs, addressing inherent biases from training datasets, and exploring better fine-tuning strategies to improve generalization across unseen text inputs. Moreover, expanding the dataset variegation to incorporate more real-world scenarios could vastly improve Phenaki's efficacy, augmenting its utility in diverse applications from creative industries to educational tools.