AI Choreographer: Music Conditioned 3D Dance Generation with AIST++ (2101.08779v3)

Published 21 Jan 2021 in cs.CV, cs.GR, and cs.MM

Abstract: We present AIST++, a new multi-modal dataset of 3D dance motion and music, along with FACT, a Full-Attention Cross-modal Transformer network for generating 3D dance motion conditioned on music. The proposed AIST++ dataset contains 5.2 hours of 3D dance motion in 1408 sequences, covering 10 dance genres with multi-view videos with known camera poses -- the largest dataset of this kind to our knowledge. We show that naively applying sequence models such as transformers to this dataset for the task of music conditioned 3D motion generation does not produce satisfactory 3D motion that is well correlated with the input music. We overcome these shortcomings by introducing key changes in its architecture design and supervision: FACT model involves a deep cross-modal transformer block with full-attention that is trained to predict $N$ future motions. We empirically show that these changes are key factors in generating long sequences of realistic dance motion that are well-attuned to the input music. We conduct extensive experiments on AIST++ with user studies, where our method outperforms recent state-of-the-art methods both qualitatively and quantitatively.

Citations (391)

View on Semantic Scholar

Summary

The paper introduces an innovative FACT network that uses full-attention cross-modal transformers to generate accurate 3D dance motions from music.
The study presents the AIST++ dataset with 5.2 hours of synchronized 3D dance and music sequences, setting a new benchmark in motion analysis.
Empirical results demonstrate superior motion realism and beat alignment compared to previous state-of-the-art methods.

An Overview of "AI Choreographer: Music Conditioned 3D Dance Generation with AIST++"

This paper introduces a novel approach towards the conditional generation of 3D dance motion sequences using a newly proposed dataset, AIST++, and the Full-Attention Cross-modal Transformer (FACT) network. By extending transformer-based architectures with modifications tailored to cross-modal sequence generation, this work addresses the complex challenge of aligning expressive human dance motion with music, a task that involves capturing intricate temporal correlations between two dynamic modalities.

AIST++ Dataset

AIST++ emerges as a pivotal contribution from this paper, being one of the largest and most comprehensive datasets available for the paper of 3D dance. It comprises 5.2 hours of synchronized 3D dance motion and music across 1,408 sequences, representing diverse genres including both Old School (Break, Pop, Lock, and Waack) and New School dance styles. The authors meticulously reconstruct 3D motion from multi-view videos and provide extensive annotations that facilitate further research in human motion analysis. The presence of both joint rotations and global translations ensures application-ready formats for areas like motion retargeting.

FACT Network Architecture

The FACT network, designed as a Full-Attention Cross-modal Transformer, is central to this paper’s 3D dance generation methodology. This architecture improves upon traditional sequence models by leveraging deep cross-modal attention mechanisms that respect the temporal and contextual intricacies of music and motion data. The model's innovation lies in its architecture, which uses a full-attention scheme augmented by Future-N supervision. The full-attention mechanism processes inputs with a comprehensive understanding of their temporal context, bypassing limitations seen in traditional transformer approaches that employ causal attention.

By performing extensive evaluations, the authors demonstrate that their model outperforms previous state-of-the-art methods in generating coherent and realistic 3D dance motions when conditioned on music inputs. The FACT network effectively mitigates known issues in motion sequence generation such as freezing and drift, delivering stable long-sequences that maintain fidelity to the input music’s structure.

Empirical Results and Evaluations

The robust quantitative framework is employed to validate the model's performance. Metrics such as Frechet Inception Distance (FID) are adapted to assess the quality and realism of generated motion, while novel evaluations like Beat Alignment Score quantitatively capture the music-motion synchronization. Notably, FACT exhibits superior results in motion diversity and motion-music correlation, a testament to its advanced cross-modal feature integration and learning capabilities.

User studies further augment these findings by attesting to the perceptual quality and music-matching ability of generated dance sequences. The model, compared against several baselines, demonstrates a marked improvement in perceived motion quality when aligned with music.

Implications and Future Directions

This work presents significant implications for practical applications ranging from virtual character animation to interactive entertainment technologies. By extending transformer-based learning to nuanced cross-modal tasks, it enriches the theoretical landscape of sequence-to-sequence generation, highlighting the potential of advanced temporal modeling in multimedia synthesis.

Future explorations could delve into enhancing the physical realism of motion outputs, exploring unsupervised approaches for increased generative diversity, and further expanding the dataset to include enhanced dance typologies or interactive scenarios. Such advancements could unlock nuanced human-AI collaborations in creative fields, facilitating novel intersections between dance, music, and artificial intelligence.

PDF Markdown

Related Papers

YouTube

Show All Videos