Video-to-Video Synthesis (1808.06601v2)

Published 20 Aug 2018 in cs.CV, cs.GR, and cs.LG

Abstract: We study the problem of video-to-video synthesis, whose goal is to learn a mapping function from an input source video (e.g., a sequence of semantic segmentation masks) to an output photorealistic video that precisely depicts the content of the source video. While its image counterpart, the image-to-image synthesis problem, is a popular topic, the video-to-video synthesis problem is less explored in the literature. Without understanding temporal dynamics, directly applying existing image synthesis approaches to an input video often results in temporally incoherent videos of low visual quality. In this paper, we propose a novel video-to-video synthesis approach under the generative adversarial learning framework. Through carefully-designed generator and discriminator architectures, coupled with a spatio-temporal adversarial objective, we achieve high-resolution, photorealistic, temporally coherent video results on a diverse set of input formats including segmentation masks, sketches, and poses. Experiments on multiple benchmarks show the advantage of our method compared to strong baselines. In particular, our model is capable of synthesizing 2K resolution videos of street scenes up to 30 seconds long, which significantly advances the state-of-the-art of video synthesis. Finally, we apply our approach to future video prediction, outperforming several state-of-the-art competing systems.

Authors (7)

Ting-Chun Wang (26 papers)
Ming-Yu Liu (87 papers)
Jun-Yan Zhu (80 papers)
Guilin Liu (78 papers)
Andrew Tao (40 papers)
Jan Kautz (215 papers)
Bryan Catanzaro (123 papers)

Citations (949)

View on Semantic Scholar

Summary

The paper introduces a GAN-based framework that synthesizes photorealistic and temporally coherent videos from semantic segmentation masks using spatio-temporal adversarial objectives.
It employs a sequential generator with optical flow and multi-scale discriminators to ensure visual fidelity and temporal consistency, achieving superior FID scores and human evaluations.
The approach supports multimodal synthesis and semantic manipulation, outperforming baseline methods on benchmarks like Cityscapes and Apolloscape.

Video-to-Video Synthesis: An Overview

The paper "Video-to-Video Synthesis" authored by Ting-Chun Wang et al. addresses the significant yet less explored problem of video-to-video synthesis. This research aims to learn a mapping function from an input source video, such as a sequence of semantic segmentation masks, to an output photorealistic video. This task is inspired by the popular image-to-image translation problem but requires modeling temporal dynamics to ensure the generated videos are temporally coherent.

Problem Statement and Methodology

The main challenge in video-to-video synthesis lies in maintaining temporal coherence while achieving high visual fidelity in the generated frames. The authors propose a framework based on generative adversarial networks (GANs) to tackle this problem. The framework includes carefully designed generators and discriminators, and employs a spatio-temporal adversarial objective. Specifically, the model aims to match the conditional distribution of generated videos to that of real videos.

The generator is designed to be sequential, making a Markov assumption and generating each frame conditioned on the past few frames and the current input frame. This approach is complemented by an elaborate spatio-temporal GAN framework that includes:

Sequential Generator with Flow-Based Synthesis: The generator leverages optical flow to predict and warp previous frames to generate the next frame, complemented by a hallucination network to handle occluded regions.
Multi-Scale Discriminators: Two types of discriminators are employed— a conditional image discriminator ensuring frame-wise photorealism and a conditional video discriminator ensuring temporal consistency across frames.
Foreground-Background Prior: For videos with semantic segmentation masks, the generator uses separate networks for foreground and background synthesis, improving the quality and realism of the generated videos.

Experimental Results

The proposed method was evaluated on several datasets, including Cityscapes, Apolloscape, FaceForensics, and a curated dance video dataset. The results demonstrated that the proposed approach significantly advances the state-of-the-art in terms of both visual quality and temporal coherence.

Cityscapes Benchmark

On the Cityscapes dataset, the proposed method outperformed baseline methods such as frame-by-frame synthesis with pix2pixHD and a variant using optical flow-based video style transfer. The evaluation metrics included Frechet Inception Distance (FID) using pre-trained video recognition models (I3D and ResNeXt) and human preference scores from Amazon Mechanical Turk (AMT) evaluations.

FID Scores: The proposed method achieved an FID of 4.66, significantly better than the baselines (5.57 and 5.55).
Human Preference Scores: The method achieved a preference score of 0.87 for short sequences and 0.83 for long sequences, outperforming alternative methods substantially.

Apolloscape Benchmark

Similarly, on the Apolloscape dataset, the proposed method demonstrated lower FID scores and higher human preference scores compared to the baselines, confirming its efficacy across different datasets and scenarios.

Multimodal and Semantic Manipulation

The method also supports multimodal synthesis and semantic manipulation, enabling the generation of diverse video outputs from the same input and facilitating high-level control over video generation. Examples include changing road surfaces and synthesizing videos with different appearances based on sampled feature vectors from a learned distribution.

Extensions and Future Directions

An interesting extension of this work is the application to future video prediction. The authors proposed a two-stage approach: predicting future semantic segmentation masks followed by video synthesis using the proposed method. Quantitative and qualitative evaluations showed that the method outperformed state-of-the-art future video prediction techniques like PredNet and MCNet.

Implications and Future Work

The implications of this research are significant for fields such as computer vision, robotics, and graphics. Practical applications include model-based reinforcement learning, video editing, and virtual reality. Future directions may focus on improving the robustness to 3D cues like depth maps, ensuring consistent object appearances across frames, and handling intricate label manipulations without artifacts.

In conclusion, the proposed video-to-video synthesis framework marks a substantial progression in generating high-resolution, photorealistic, and temporally coherent videos from semantic inputs. The careful integration of GANs, optical flow, and spatio-temporal objectives sets a strong foundation for further advancements in this domain.

Related Papers

YouTube

Show All Videos