Music-oriented Dance Video Synthesis with Pose Perceptual Loss (1912.06606v1)

Published 13 Dec 2019 in cs.CV, cs.LG, cs.MM, and eess.IV

Abstract: We present a learning-based approach with pose perceptual loss for automatic music video generation. Our method can produce a realistic dance video that conforms to the beats and rhymes of almost any given music. To achieve this, we firstly generate a human skeleton sequence from music and then apply the learned pose-to-appearance mapping to generate the final video. In the stage of generating skeleton sequences, we utilize two discriminators to capture different aspects of the sequence and propose a novel pose perceptual loss to produce natural dances. Besides, we also provide a new cross-modal evaluation to evaluate the dance quality, which is able to estimate the similarity between two modalities of music and dance. Finally, a user study is conducted to demonstrate that dance video synthesized by the presented approach produces surprisingly realistic results. The results are shown in the supplementary video at https://youtu.be/0rMuFMZa_K4

Citations (17)

View on Semantic Scholar

Summary

The paper introduces a two-stage system that uses bidirectional GRU and dual discriminators to transform music into coherent human skeleton sequences.
It incorporates a novel pose perceptual loss using a pre-trained ST-GCN, effectively handling noisy pose detection data for improved video synthesis.
The evaluation employs BRISQUE metrics and user studies, demonstrating the approach's effectiveness in producing realistic, music-synchronized dance videos.

Music-oriented Dance Video Synthesis with Pose Perceptual Loss

The paper "Music-oriented Dance Video Synthesis with Pose Perceptual Loss" presents a comprehensive approach for generating dance videos aligned with a given piece of music. This involves synthesizing a human skeleton sequence derived from music and employing a learned pose-to-appearance mapping to produce the final dance video. The proposed method pivots on the incorporation of pose perceptual loss, allowing it to discern the connection between two modalities, music, and dance, thereby producing realistic and music-synchronized dance videos.

Methodology

The methodology is structured in two primary stages:

Skeleton Sequence Generation:
- An audio input is initially transformed into a feature vector representation.
- The generator comprises a music encoding section and a pose generator, employing a bidirectional GRU to manage temporal dependencies effectively.
- The process leverages two discriminators—Local Temporal Discriminator and Global Content Discriminator—to address local coherence and global harmony respectively.
Pose-to-Video Mapping:
- Extends the pose generation task into a full-fledged video synthesis task, transferring the articulated movements from the skeleton sequence onto target personas in videos.

Experimental Design and Evaluation

The authors develop three significant components to enhance the dance video synthesis task:

Pose Perceptual Loss: This novel loss function is the paper's centerpiece, permitting noisy, imperfect data training by leveraging high-level structural information extraction enabled by a pre-trained spatial-temporal GCN (ST-GCN). This innovation directly addresses the typical noise issues in pose detection datasets, mainly those sourced from OpenPose.
Discriminators Design: By deploying two tailored discriminators (Local Temporal and Global Content), the model scrutinizes both individual pose transitions and overall sequence coherence against the music's beat, rhythm, and mood.
Cross-modal Evaluation Metric: Due to the inherent difficulty in measuring the quality of dance corresponding to a piece of music, a new metric aligning music embeddings with pose sequences was developed, providing a quantitative basis for one of the evaluation protocols.

The methodology's efficacy was further underscored by comprehensive quantitative assessments such as BRISQUE for video quality, supplemented by subjective user studies where the synthesized results were favorably compared against real dance sequences.

Implications and Future Work

The implications of this research are manifold. Practically, this approach enables the automatic generation of personalized music videos without requiring artist intervention, a domain with significant consumer and entertainment industry interest. Theoretically, it provides a foundational framework for exploring complex cross-modal generative tasks involving sequential data.

Moving forward, this work invites potential extensions such as improving cross-modal translation techniques, enhancing generalization capabilities across varying dance styles and music genres, and refining the skeleton-to-video transfer efficiencies. Further investigation could also delve into real-time synthesis capabilities to broaden the application scope in interactive media environments.

In summary, the paper offers a structured pipeline and thoughtful evaluation for synthesizing dance videos, achieving a robust alignment of pose generation with musical input. This work contributes significantly to A.I.-mediated creative processes, setting a benchmark in the interdisciplinary area of automated dance music video generation.

PDF Markdown

Related Papers

YouTube

Show All Videos