VideoGPT: Video Generation using VQ-VAE and Transformers (2104.10157v2)

Published 20 Apr 2021 in cs.CV and cs.LG

Abstract: We present VideoGPT: a conceptually simple architecture for scaling likelihood based generative modeling to natural videos. VideoGPT uses VQ-VAE that learns downsampled discrete latent representations of a raw video by employing 3D convolutions and axial self-attention. A simple GPT-like architecture is then used to autoregressively model the discrete latents using spatio-temporal position encodings. Despite the simplicity in formulation and ease of training, our architecture is able to generate samples competitive with state-of-the-art GAN models for video generation on the BAIR Robot dataset, and generate high fidelity natural videos from UCF-101 and Tumbler GIF Dataset (TGIF). We hope our proposed architecture serves as a reproducible reference for a minimalistic implementation of transformer based video generation models. Samples and code are available at https://wilson1yan.github.io/videogpt/index.html

Authors (4)

Wilson Yan (12 papers)
Yunzhi Zhang (22 papers)
Pieter Abbeel (372 papers)
Aravind Srinivas (20 papers)

Citations (415)

View on Semantic Scholar

Summary

Overview of "VideoGPT: Video Generation using VQ-VAE and Transformers"

The paper "VideoGPT: Video Generation using VQ-VAE and Transformers" introduces a novel approach to video generation by leveraging the Vector Quantized Variational Autoencoder (VQ-VAE) and Transformers. The authors present a streamlined architecture that focuses on likelihood-based generative modeling, adapting these well-known methods to the more complex domain of videos.

Core Contributions

VideoGPT employs a two-phase approach:

Learning Latent Representations: The first phase uses VQ-VAE to compress video inputs into discrete latent representations, employing 3D convolutions and axial self-attention. This results in a downsampling of spatial and temporal dimensions, enabling efficient modeling.
Autoregressive Modeling: In the second phase, a GPT-like autoregressive transformer is used to model these latent codes. This approach leverages spatio-temporal position encodings to generate new video sequences effectively.

Experimental Results

The architecture demonstrates competitive performance in video generation compared to existing state-of-the-art methods such as GANs. Key quantitative evaluations include:

BAIR Robot Pushing Dataset: VideoGPT achieved an FVD of 103, indicating its capability to generate realistic video frames.
Complex Video Datasets: High-quality video samples were produced from datasets like UCF-101 and TGIF, showcasing robustness in more diverse scenarios.

Additionally, VideoGPT shows adaptability in action-conditional video generation and promises computational efficiency due to its latent space modeling.

Architectural Insights and Ablations

Several ablation studies emphasize the impact of various design choices:

Axial Attention Blocks: Crucial for improving the reconstruction quality of VQ-VAE representations.
Prior Network Capacity: Larger transformer models with more layers yield better performance metrics, underscoring the importance of model size.
Latent Space Design: Balanced temporal-spatial downsampling in the latent space significantly influences generation quality.

The research identifies an optimal design balance between latent space size and transformer capacity, which maximizes generative performance without exceeding computational constraints.

Implications and Future Directions

The implications of VideoGPT are manifold. Practically, it provides a reproducible framework for video generation tasks, offering a pathway to more scalable models that can efficiently manage high-dimensional video data. Theoretically, it enriches the discourse on autoregressive modeling in latent spaces, potentially influencing how future models approach high-dimensional generative tasks.

Speculation on future developments includes:

Scaling to Higher Resolutions: Extending the approach to even higher resolutions and longer sequences could further enhance the utility in diverse applications such as video editing and content creation.
Integration with Larger Datasets: Expanding the dataset scope could address overfitting challenges, as observed with UCF-101.
Hybrid Architectures: Combining likelihood-based models with adversarial models might capture the best of both, leading to superior video quality and diversity.

Overall, VideoGPT stands as a significant contribution to the field, suggesting a promising trajectory for subsequent research in neural video generation.

PDF Markdown

Related Papers

Find Related Papers

GitHub

Tweets

https://twitter.com/f0c1s/status/1759250805834596692