Adversarial Video Generation on Complex Datasets (1907.06571v2)

Published 15 Jul 2019 in cs.CV, cs.LG, and stat.ML

Abstract: Generative models of natural images have progressed towards high fidelity samples by the strong leveraging of scale. We attempt to carry this success to the field of video modeling by showing that large Generative Adversarial Networks trained on the complex Kinetics-600 dataset are able to produce video samples of substantially higher complexity and fidelity than previous work. Our proposed model, Dual Video Discriminator GAN (DVD-GAN), scales to longer and higher resolution videos by leveraging a computationally efficient decomposition of its discriminator. We evaluate on the related tasks of video synthesis and video prediction, and achieve new state-of-the-art Fr\'echet Inception Distance for prediction for Kinetics-600, as well as state-of-the-art Inception Score for synthesis on the UCF-101 dataset, alongside establishing a strong baseline for synthesis on Kinetics-600.

View on arXiv

Authors (3)

Aidan Clark (13 papers)
Jeff Donahue (26 papers)
Karen Simonyan (54 papers)

Citations (74)

View on Semantic Scholar

Summary

Adversarial Video Generation on Complex Datasets: A Technical Overview

The paper presents significant advancements in the field of generative video modeling through the introduction of the Dual Video Discriminator GAN (DVD-GAN). Building on the foundation laid by generative models for static images, this research extends the scope to video generation, characterized by its increased complexity due to temporal dynamics and higher data dimensionality. The proposed DVD-GAN architecture adapts the successful principles of Generative Adversarial Networks (GAN) to manage these complexities effectively, demonstrating superior performance in generating high-fidelity video samples from complex datasets.

Key Contributions

The primary contributions of the paper can be summarized as follows:

Introduction of DVD-GAN: The paper introduces a novel generative model, the DVD-GAN, which efficiently scales to generate longer and higher-resolution videos. It achieves this by employing a computationally efficient decomposition within the discriminator, allowing it to handle the extensive data inherent in high-resolution videos.
State-of-the-Art Performance: The model sets new benchmarks in video synthesis and prediction, achieving new records in Fréchet Inception Distance (FID) for prediction on the Kinetics-600 dataset and the Inception Score (IS) for video synthesis on the UCF-101 dataset.
Benchmark Establishment: The research establishes class-conditional video synthesis on the Kinetics-600 dataset as a new standard for evaluation in generative video modeling, providing a robust baseline with DVD-GAN results.

Technical Details

Video Synthesis and Prediction

The tasks of video synthesis and future video prediction are rigorously explored. Video synthesis involves generating a video clip from a given class label, while prediction requires the generation of future frames based on an initial set of frames. DVD-GAN addresses the challenges in these tasks by ensuring realistic temporal coherence and maintaining high visual quality across individual frames.

Dual Discriminators

A pivotal innovation in DVD-GAN is the use of dual discriminators: a Spatial Discriminator ( $D_S$ ) and a Temporal Discriminator ( $D_T$ ). The spatial component critiques individual frame content, while the temporal component ensures correct motion dynamics across frames. This decomposition reduces computational load while maintaining the ability to detect temporal and spatial inconsistencies effectively.

Dataset and Evaluation

The research employs extensive datasets, primarily Kinetics-600 and UCF-101, for training and evaluation. Kinetics-600, known for its diverse and unconstrained nature, presents a challenging benchmark for generative modeling. The evaluation metrics include Inception Score and Fréchet Inception Distance, augmented by an Inflated 3D Convnet (I3D) model specifically trained for video classification tasks.

Experimental Results

The DVD-GAN achieves remarkable results across various configurations (e.g., video resolution and length). It demonstrates the capability to generate temporally coherent videos even when assessed against state-of-the-art models across challenging datasets. For Kinetics-600, the model's performance is underscored by its capacity to handle complex datasets without the risk of overfitting that typically plagues smaller datasets.

Implications and Future Directions

The implications of this work are significant for the field of AI-driven video content creation. Practically, the ability to generate high-fidelity, class-conditional videos has potential applications in media production, simulation, and training data generation. Theoretically, the introduction of dual discriminators and the successful training of DVD-GAN on the Kinetics-600 dataset provide fertile ground for future research. This methodology can inspire further work into scalable model architectures and efficient training strategies for generative models dealing with complex temporal data.

Conclusion

The paper advances the understanding and capability of GANs within the video domain, proving that generating high-resolution, temporally coherent video sequences is feasible with current methodologies applied to robust architectures like DVD-GAN. Future exploration is encouraged in enhancing model efficiency and extending the applicability of such models to even broader and more diverse video datasets.

PDF Markdown

Related Papers

Video-to-Video Synthesis (2018)
Video Diffusion Models (2022)
Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks (2022)
Generating Videos with Scene Dynamics (2016)
Dual Motion GAN for Future-Flow Embedded Video Prediction (2017)

Find Related Papers

Tweets

https://twitter.com/Merzmensch/status/1836117091939717294

https://twitter.com/f0c1s/status/1759250679552438563

YouTube

Show All Videos