StyleInV: A Temporal Style Modulated Inversion Network for Unconditional Video Generation (2308.16909v1)

Published 31 Aug 2023 in cs.CV and cs.AI

Abstract: Unconditional video generation is a challenging task that involves synthesizing high-quality videos that are both coherent and of extended duration. To address this challenge, researchers have used pretrained StyleGAN image generators for high-quality frame synthesis and focused on motion generator design. The motion generator is trained in an autoregressive manner using heavy 3D convolutional discriminators to ensure motion coherence during video generation. In this paper, we introduce a novel motion generator design that uses a learning-based inversion network for GAN. The encoder in our method captures rich and smooth priors from encoding images to latents, and given the latent of an initially generated frame as guidance, our method can generate smooth future latent by modulating the inversion encoder temporally. Our method enjoys the advantage of sparse training and naturally constrains the generation space of our motion generator with the inversion network guided by the initial frame, eliminating the need for heavy discriminators. Moreover, our method supports style transfer with simple fine-tuning when the encoder is paired with a pretrained StyleGAN generator. Extensive experiments conducted on various benchmarks demonstrate the superiority of our method in generating long and high-resolution videos with decent single-frame quality and temporal consistency.

PDF Abstract

StyleInV: A Temporal Style Modulated Inversion Network for Unconditional Video Generation

The paper "StyleInV: A Temporal Style Modulated Inversion Network for Unconditional Video Generation" presents a novel approach to generating high-quality, long-duration videos by leveraging a pretrained StyleGAN image generator. This paper addresses the inherent challenges of unconditional video generation, which include synthesizing coherent motion and handling significant memory demands associated with high-resolution video creation over extended timespans.

Key Contributions

Non-autoregressive Motion Generation

The paper introduces a non-autoregressive motion generator, characterized by its use of a learning-based inversion network for Generative Adversarial Networks (GANs). Unlike traditional autoregressive techniques, where future frames depend directly on past frames, the non-autoregressive method facilitates independent frame synthesis once an initial condition is set. This approach enables sparse training, significantly reducing computational demands and removing the necessity for heavy 3D convolutional discriminators typically required to maintain temporal consistency.

Temporal Style Modulation

A central feature of StyleInV is its modulation of the GAN inversion network with temporal styles—a combination of motion codes and an initial latent frame code. This design derives from observations that a StyleGAN's $\mathcal{W}$ latent space clusters around distinct identity features, even when trained on video data. Consequently, the inversion encoder enriches the model's capacity to generate temporally coherent video frames by leveraging these implicit temporal priors.

Flexible Style Transfer

StyleInV supports style transfer via simple fine-tuning of the generator, a capability rooted in its architectural use of a pretrained image generator. By fixing certain layers during fine-tuning (specifically the mapping and lower-resolution synthesis layers), the system effectively adapts to new image datasets to generate video with modified style characteristics while sustaining the original motion patterns.

Experimental Validation

Evaluations across several video benchmarks—DeeperForensics, FaceForensics, SkyTimelapse, and TaiChi—demonstrate StyleInV's competitive performance against existing state-of-the-art video generation models like MoCoGAN-HD, DIGAN, StyleGAN-V, and Long-Video-GAN.

FID and FVD Scores: StyleInV outperforms in most assessments, recording lower FID scores that reflect superior single-frame quality as well as lower FVD scores indicating better temporal motion coherence over both short and long sequences.
Qualitative Analysis: Generated video quality showed high fidelity in maintaining subject identity, expression, and style continuity, crucial for applications such as video editing and animation.

Implications and Future Developments

Beyond immediate applications in high-quality video generation and style-transfer capabilities, StyleInV poses significant theoretical and practical implications:

Generalizability: By decoupling motion generation from frame-to-frame dependency, the model showcases scalability potential to arbitrarily long video sequences without typical autoregressive destabilization effects.
Integration Potential: As image GANs continue advancing, StyleInV can exploit more sophisticated generators, likely leading to enhancement in both visual quality and generation speed.

Future research may explore incorporating more comprehensive datasets to refine model robustness concerning global motion dynamics and identity richness. Furthermore, research could investigate synergy with diffusion models to balance the trade-offs between temporal consistency and real-time inferential efficiency further.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Yuhan Wang (93 papers)
Liming Jiang (29 papers)
Chen Change Loy (288 papers)

Citations (13)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos