Everybody Dance Now (1808.07371v2)

Published 22 Aug 2018 in cs.GR and cs.CV

Abstract: This paper presents a simple method for "do as I do" motion transfer: given a source video of a person dancing, we can transfer that performance to a novel (amateur) target after only a few minutes of the target subject performing standard moves. We approach this problem as video-to-video translation using pose as an intermediate representation. To transfer the motion, we extract poses from the source subject and apply the learned pose-to-appearance mapping to generate the target subject. We predict two consecutive frames for temporally coherent video results and introduce a separate pipeline for realistic face synthesis. Although our method is quite simple, it produces surprisingly compelling results (see video). This motivates us to also provide a forensics tool for reliable synthetic content detection, which is able to distinguish videos synthesized by our system from real data. In addition, we release a first-of-its-kind open-source dataset of videos that can be legally used for training and motion transfer.

Authors (4)

Caroline Chan (5 papers)
Shiry Ginosar (16 papers)
Tinghui Zhou (14 papers)
Alexei A. Efros (100 papers)

Citations (730)

View on Semantic Scholar

Summary

The paper introduces a novel motion transfer technique that uses intermediate pose representations and adversarial networks to generate realistic videos of untrained subjects.
It employs a three-stage pipeline—pose detection, global normalization, and pose-to-video translation enhanced by a specialized face GAN—to ensure temporal coherence and high fidelity.
Experimental results show significant improvements over benchmarks, with a 95.1% preference in perceptual studies, demonstrating its practical impact in content creation and virtual environments.

An Insightful Overview of "Everybody Dance Now"

The paper "Everybody Dance Now" by Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A. Efros presents a novel approach for motion transfer, where given a video of a source dancer, the method can synthesize a video of a different, untrained target person performing the same motions. This technique is characterized by its simplicity and robustness, utilizing a video-to-video translation approach with pose as an intermediate representation.

Methodology

The proposed framework consists of a three-stage pipeline:

Pose Detection: A state-of-the-art pose detector, such as OpenPose, extracts 2D joint coordinates from video frames, creating pose stick figures.
Global Pose Normalization: The extracted poses are normalized to account for differences in body shapes and positions within the frame between the source and target subjects.
Pose to Video Translation: The core of the methodology involves learning a mapping from pose stick figures to images of the target subject using an adversarial neural network framework. This setup includes mechanisms for temporal smoothing to ensure coherence across frames and a specialized GAN for generating high-fidelity face details.

The training procedure employs a multi-scale discriminator and a perceptual reconstruction loss to improve realism and temporal coherence in the synthesized video.

Key Contributions

Intermediate Pose Representation: Utilizing pose stick figures as an intermediate representation abstracts away subject identity, preserving motion signatures which can be universally applied to different individuals.
Temporal Smoothing: Introducing frame-to-frame prediction ensures temporal coherence in the generated videos, which is crucial for realistic motion synthesis.
Face GAN: Adding a specialized adversarial network for the face region enhances detail and realism, addressing one of the critical areas for perceptual quality in synthesized videos.
Open-Source Dataset: The authors release a novel dataset of legally usable videos for training and evaluating motion transfer methods, facilitating further research in this area.

Experimental Results

The authors conduct a comprehensive evaluation through perceptual studies and quantitative metrics such as SSIM and LPIPS. Results indicate that their method outperforms baseline approaches including nearest neighbors and PoseWarp on both perceptual quality and quantitative assessments. Additionally, the ablation studies demonstrate the effectiveness of each module, particularly the temporal smoothing and face GAN components.

For instance, their method was preferred 95.1% of the time over nearest neighbors and 83.3% over PoseWarp in perceptual studies. Such results highlight the robustness and quality of the proposed approach.

Practical and Theoretical Implications

This work holds significant practical implications:

Content Creation: Enabling amateur users to perform complex dance routines or martial arts demonstrated by professionals.
Entertainment: Applications in film and gaming industries for generating realistic animations.
Virtual Reality: Enhancing user experiences in virtual environments by enabling realistic human motion transfer.

From a theoretical perspective, this research advances understanding in:

Pose Representation: Proving the efficacy of pose as an intermediate for domain adaptation.
Temporal Coherence: Highlighting the importance of temporal dynamics in video synthesis.
GAN Frameworks: Extending adversarial training to handle not only spatial but also temporal aspects of video data.

Future Directions

Future research could delve into several promising areas:

Improved Pose Detection: Enhancing robustness, particularly for challenging scenarios like occlusions or extreme poses.
Multiview Consistency: Ensuring consistency across different camera angles for applications in multi-camera setups.
Real-time Synthesis: Optimizing the pipeline for real-time applications, crucial for interactive media and live performances.

Conclusion

The paper "Everybody Dance Now" provides a significant contribution to the field of motion transfer by introducing a simple yet effective method for synthesizing videos of target individuals performing complex motions presented by source subjects. The blend of an intermediate pose representation, temporal smoothing, and specialized face GAN culminates in a framework capable of producing high-quality, realistic video results, opening new avenues for practical applications and further research.

PDF Markdown

Related Papers

YouTube

Show All Videos