Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks (2202.10571v1)

Published 21 Feb 2022 in cs.CV and cs.LG

Abstract: In the deep learning era, long video generation of high-quality still remains challenging due to the spatio-temporal complexity and continuity of videos. Existing prior works have attempted to model video distribution by representing videos as 3D grids of RGB values, which impedes the scale of generated videos and neglects continuous dynamics. In this paper, we found that the recent emerging paradigm of implicit neural representations (INRs) that encodes a continuous signal into a parameterized neural network effectively mitigates the issue. By utilizing INRs of video, we propose dynamics-aware implicit generative adversarial network (DIGAN), a novel generative adversarial network for video generation. Specifically, we introduce (a) an INR-based video generator that improves the motion dynamics by manipulating the space and time coordinates differently and (b) a motion discriminator that efficiently identifies the unnatural motions without observing the entire long frame sequences. We demonstrate the superiority of DIGAN under various datasets, along with multiple intriguing properties, e.g., long video synthesis, video extrapolation, and non-autoregressive video generation. For example, DIGAN improves the previous state-of-the-art FVD score on UCF-101 by 30.7% and can be trained on 128 frame videos of 128x128 resolution, 80 frames longer than the 48 frames of the previous state-of-the-art method.

PDF Abstract

Overview of Dynamics-aware Implicit Generative Adversarial Networks for Video Generation

The paper proposes a novel framework called Dynamics-aware Implicit Generative Adversarial Networks (DIGAN) for addressing the challenges of video generation using deep learning models. Video generation involves synthesizing continuous spatio-temporal signals which pose significant challenges due to their complexity and the requirement for coherence in both spatial and temporal domains. Traditional approaches often model video as discrete RGB grids, which can hinder scalability and the effective representation of continuous dynamics. DIGAN introduces implicit neural representations (INRs) as a solution to these issues, allowing videos to be encoded in a compact and continuous format within neural networks.

Key Contributions

INR-based Video Generator:
- The DIGAN framework includes an INR-based video generator that differentiates between spatial and temporal coordinates. This differentiation is crucial for improving motion dynamics and maintaining video coherency.
- The generator employs a decomposition strategy, separating content (image features) and motion (dynamic features). Motion features are generated using both content-related and independent motion vectors.
- Temporal dynamics are integrated by using smaller temporal frequencies compared to spatial frequencies, and additional non-linear mappings are applied to motion features to enhance expressiveness.
Motion Discriminator:
- A motion discriminator is proposed, leveraging 2D convolutional networks instead of more computationally intensive 3D networks. This efficiency arises from the INR's ability to synthesize correlated frames across arbitrary time points, allowing for effective discrimination based on short frame sequences.

Experimental Results and Metrics

DIGAN is evaluated across multiple datasets, including UCF-101, Tai-Chi-HD, Sky Time-lapse, and a subset of Kinetics-600. It significantly improves upon existing video generation benchmarks, demonstrating superior performance with a notable improvement in Fréchet Video Distance (FVD) scores. For instance, DIGAN reduces the FVD score on the UCF-101 dataset by 30.7% compared to the previous state-of-the-art. It efficiently trains on longer video sequences of up to 128 frames, which is much longer than the capabilities of previous models.

Intriguing Properties and Capabilities

DIGAN exhibits several advantageous properties:

Long Video Synthesis: Capable of generating high-resolution videos of extended length without excessive computational demands.
Time Interpolation and Extrapolation: Enables smooth transitions and expansions beyond the original frame sequence.
Non-autoregressive Generation: Facilitates faster inference and allows generation steps to occur non-sequentially, benefiting applications requiring speed and flexibility.
Diverse Motion Sampling: Provides potential for generating different video dynamics from a shared initial frame.
Spatial Interpolation and Extrapolation: Supports video upsampling and zooming out, maintaining temporal coherence.

Implications and Future Research

The introduction of DIGAN and its success in modeling continuous video signals indicates significant potential for both theoretical advancements and practical applications in AI-driven video synthesis. The efficient handling of spatio-temporal dimensions could inspire new research directions in implicit neural representations and generative models, including applications in simulation, gaming, and visual content creation. Moreover, the computational efficiency and robustness of DIGAN suggest that INR-based architectures might offer valuable alternatives for large-scale generative tasks across various domains.

DIGAN sets a solid foundation for future research into leveraging INR-driven approaches for complex generative tasks, aiming for more realistic and diverse video outputs while keeping computational costs manageable.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Sihyun Yu (16 papers)
Jihoon Tack (21 papers)
Sangwoo Mo (20 papers)
Hyunsu Kim (27 papers)
Junho Kim (57 papers)
Jung-Woo Ha (67 papers)
Jinwoo Shin (196 papers)

Citations (181)

View on Semantic Scholar