A Good Image Generator Is What You Need for High-Resolution Video Synthesis (2104.15069v1)

Published 30 Apr 2021 in cs.CV

Abstract: Image and video synthesis are closely related areas aiming at generating content from noise. While rapid progress has been demonstrated in improving image-based models to handle large resolutions, high-quality renderings, and wide variations in image content, achieving comparable video generation results remains problematic. We present a framework that leverages contemporary image generators to render high-resolution videos. We frame the video synthesis problem as discovering a trajectory in the latent space of a pre-trained and fixed image generator. Not only does such a framework render high-resolution videos, but it also is an order of magnitude more computationally efficient. We introduce a motion generator that discovers the desired trajectory, in which content and motion are disentangled. With such a representation, our framework allows for a broad range of applications, including content and motion manipulation. Furthermore, we introduce a new task, which we call cross-domain video synthesis, in which the image and motion generators are trained on disjoint datasets belonging to different domains. This allows for generating moving objects for which the desired video data is not available. Extensive experiments on various datasets demonstrate the advantages of our methods over existing video generation techniques. Code will be released at https://github.com/snap-research/MoCoGAN-HD.

PDF Abstract

High-Resolution Video Synthesis Leveraging Image Generators

The paper "A Good Image Generator Is What You Need for High-Resolution Video Synthesis" addresses the persistent challenge of achieving high-quality video synthesis, leveraging advancements in image generation to enhance video synthesis capabilities. This paper presents a paradigm shift by proposing a novel framework that utilizes pre-trained image generators to synthesize videos in high resolution, formally redefining the problem of video synthesis as trajectory discovery in the latent space of a static image generator.

Key Contributions

The central contribution of the paper is the development of a video synthesis framework that leverages contemporary image generators, such as StyleGAN2 and BigGAN, for creating high-resolution videos. The framework is supported by the following innovations:

Separation of Content and Motion: The video synthesis problem is deconstructed into two subproblems—content and motion—where the content is rendered through a pre-trained image generator, effectively leveraging high-quality static imagery, while dynamics are captured by introducing a novel motion generator.
Motion Generator with Residual Learning: A motion generator, designed as a pair of recurrent neural networks (RNNs), is tasked with modeling motion as a residual path in the latent space. This enables the system to address the temporal coherence of video sequences through disentangled motion representation.
Cross-Domain Video Synthesis: The framework introduces cross-domain video synthesis, where the image and motion generators are trained on distinct datasets from different domains. This capability allows for generating sequences with content for which direct video data might not be readily available.
Practical and Computational Efficiency: By building on pre-trained image models and searching for latent trajectories in existing latent spaces, the proposed method significantly reduces computational demands when compared to traditional video synthesis techniques, offering an order of magnitude improvement in efficiency.

Experimental Evaluation

The paper performs extensive experimentation across datasets such as UCF-101, FaceForensics, and Sky Time-lapse, to empirically validate the framework’s efficacy against existing state-of-the-art methods. The results reveal that:

On the UCF-101 dataset, the proposed method sets a new standard of performance, achieving an Inception Score (IS) of 33.95, markedly outperforming previously reported results.
The approach also delivers improved synthesis quality, maintaining identity coherence across frames in the FaceForensics dataset, as evidenced by lower Average Content Distance (ACD) compared to competitors.

Implications and Future Directions

The implications of employing pre-trained image generators for video synthesis are multifaceted:

Scalability: By defining the video synthesis problem as one of latent trajectory discovery in a fixed image latent space, the methodology is highly adaptable, providing opportunities to explore video synthesis in higher resolutions and more diverse domains.
Cross-Domain Applications: Cross-domain synthesis not only proves the robustness of disentangled representation learning but also opens avenues for creating educational content, animations, and enhanced simulations where direct video datasets may be nonexistent or limited.
AI Content Creation: This work signifies a step towards more resource-efficient AI systems capable of generating synthetic video content, which can be applied in film, gaming, and virtual reality.

Conclusion

By reframing video synthesis through the lens of image generation and latent space traversal, the research illuminates a path forward for creating high-fidelity video outputs. The blend of quality, computational efficiency, and cross-domain capabilities sets a promising trajectory for future explorations in autonomous content generation and AI-driven media applications. As the authors continue to refine their methodologies, subsequent research could further enhance motion modeling, incorporate more fine-grained temporal dynamics, and address the inherent limitations in current latent space representations for holistic video synthesis.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Yu Tian (249 papers)
Jian Ren (97 papers)
Menglei Chai (37 papers)
Kyle Olszewski (17 papers)
Xi Peng (115 papers)
Dimitris N. Metaxas (84 papers)
Sergey Tulyakov (108 papers)

Citations (174)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - snap-research/MoCoGAN-HD: [ICLR 2021 Spotlight] A Good Image Generator Is What You Need for High-Resolution Video Synthesis (238 stars)