EDGE: Editable Dance Generation From Music (2211.10658v2)

Published 19 Nov 2022 in cs.SD, cs.CV, cs.GR, and eess.AS

Abstract: Dance is an important human art form, but creating new dances can be difficult and time-consuming. In this work, we introduce Editable Dance GEneration (EDGE), a state-of-the-art method for editable dance generation that is capable of creating realistic, physically-plausible dances while remaining faithful to the input music. EDGE uses a transformer-based diffusion model paired with Jukebox, a strong music feature extractor, and confers powerful editing capabilities well-suited to dance, including joint-wise conditioning, and in-betweening. We introduce a new metric for physical plausibility, and evaluate dance quality generated by our method extensively through (1) multiple quantitative metrics on physical plausibility, beat alignment, and diversity benchmarks, and more importantly, (2) a large-scale user study, demonstrating a significant improvement over previous state-of-the-art methods. Qualitative samples from our model can be found at our website.

Citations (167)

View on Semantic Scholar

Summary

The paper leverages a transformer-based diffusion model and Jukebox embeddings to synthesize dance moves that closely follow musical cues.
The methodology introduces editability via joint-wise conditioning and in-betweening, validated by a new Physical Foot Contact score for realism.
Evaluation shows EDGE outperforms baselines in aesthetics and physicality, offering practical benefits for animation and game development.

Editable Dance Generation From Music

The paper introduces a novel methodology for dance generation from music, termed as Editable Dance GEneration (EDGE). This method significantly advances the domain of computational choreography by blending music and movement generation through AI. The EDGE framework synthesizes realistic, physically plausible dance movements that are tightly aligned with musical inputs. It leverages a transformer-based diffusion model in conjunction with Jukebox, an advanced music feature extractor, to incorporate powerful editing capabilities into dance generation.

The primary advantage of EDGE lies in its enhanced adaptability and interactivity in dance generation that goes beyond the capabilities of its predecessors. It allows for joint-wise conditioning and in-betweening, making it the first-of-its-kind editable method for choreographic synthesis. The authors propose a new metric for physical plausibility, named Physical Foot Contact score (PFC), which assesses the alignment between the movement's dynamics and the physical laws governing motion, particularly focusing on ground contact realism.

Methodology and Architecture

EDGE employs a diffusion model architecture that progressively refines random noise into coherent dance poses. This model is enhanced by a transformer framework that uses cross-attention mechanisms to integrate detailed music information derived from Jukebox embeddings. The architecture supports arbitrary-length sequence generation, crucial for applications demanding extensive choreographies.

The dance representation captures fine-grained skeletal motions via the SMPL format, recognizing both rotational dynamics and binary contact labels. The paper further introduces auxiliary loss functions to ensure physical realism in generated dances, with a distinct focus on minimizing foot-sliding inaccuracies through the Contact Consistency Loss. These architecture and methodological choices align to deliver an innovative and robust framework for synthesizing dance from music.

Evaluation and Results

The authors conducted comprehensive evaluations through human studies and established quantitative assessments. In user studies, EDGE showcases a favorable performance, significantly preferred over baseline methods like FACT and Bailando in terms of aesthetics and physicality as rated by human judges. Quantitatively, EDGE's use of the PFC metric enhances the physical credibility of generated dances over previous systems. Moreover, the adoption of Jukebox for extracting music embeddings plays a critical role in bridging musicality with movement, evidenced by improved beat alignment scores.

Implications and Future Directions

The implications of EDGE are extensive. Practically, it offers animators and game developers a tool that can generate seamless, rhythmic dance sequences conditioned on specific music, without the costly need for motion capture. Theoretically, it sets a precedent in choreography generation that considers physical dynamics, potentially opening avenues for further research into multi-agent choreography and culturally-specific dance motions.

The paper intimates future research pathways like extending EDGE to handle multi-person and scene-aware dances. The latent editability also indicates possibilities for interactive AI systems where users define particular dance constraints, thereby customizing the choreography within virtual environments. This adaptability in dance generation represents a pivotal movement towards integrating AI-generated creativity into human-artistic contexts, hinting at an evolving synergy between AI and expressive human art forms.

PDF Markdown

Related Papers

YouTube

Show All Videos