DanceChat: Large Language Model-Guided Music-to-Dance Generation (2506.10574v1)

Published 12 Jun 2025 in cs.CV, cs.MM, cs.SD, and eess.AS

Abstract: Music-to-dance generation aims to synthesize human dance motion conditioned on musical input. Despite recent progress, significant challenges remain due to the semantic gap between music and dance motion, as music offers only abstract cues, such as melody, groove, and emotion, without explicitly specifying the physical movements. Moreover, a single piece of music can produce multiple plausible dance interpretations. This one-to-many mapping demands additional guidance, as music alone provides limited information for generating diverse dance movements. The challenge is further amplified by the scarcity of paired music and dance data, which restricts the model^a\u{A}\'Zs ability to learn diverse dance patterns. In this paper, we introduce DanceChat, a LLM-guided music-to-dance generation approach. We use an LLM as a choreographer that provides textual motion instructions, offering explicit, high-level guidance for dance generation. This approach goes beyond implicit learning from music alone, enabling the model to generate dance that is both more diverse and better aligned with musical styles. Our approach consists of three components: (1) an LLM-based pseudo instruction generation module that produces textual dance guidance based on music style and structure, (2) a multi-modal feature extraction and fusion module that integrates music, rhythm, and textual guidance into a shared representation, and (3) a diffusion-based motion synthesis module together with a multi-modal alignment loss, which ensures that the generated dance is aligned with both musical and textual cues. Extensive experiments on AIST++ and human evaluations show that DanceChat outperforms state-of-the-art methods both qualitatively and quantitatively.

Summary

The paper introduces a novel LLM-guided framework that transforms music cues into textual dance instructions to direct diffusion-based motion synthesis.
It employs multi-modal feature extraction that integrates rhythmic cues, musical structure, and semantic instructions to achieve realistic, rhythmically aligned dance sequences.
The model outperforms baselines on the AIST++ dataset by achieving lower Physical Foot Contact scores and enhanced motion diversity.

Comprehensive Analysis of "DanceChat: LLM-Guided Music-to-Dance Generation"

The paper "DanceChat: LLM-Guided Music-to-Dance Generation" proposes an innovative methodology for generating dance movements from musical inputs using LLMs as intermediaries to bridge the semantic gap between music and dance. This research introduces an approach that leverages textual choreography instructions generated from music captions to guide the dance synthesis process. The proposed DanceChat model addresses key challenges in music-to-dance generation: the abstract nature of musical cues and the limited dataset availability, by employing LLMs to provide accessible, high-level guidance through natural language.

Key Contributions and Methodology

The authors delineate a system composed of three primary components:

Pseudo Instruction Generation introduces an LLM-guided framework for interpreting musical features, such as tempo and chord progressions, and translating them into textual dance instructions. This module creates a choreographic bridge by transforming abstract musical elements into specific, actionable directives which are then utilized for dance creation.
Multi-modal Feature Extraction and Fusion synthesizes music, beat, and textual guidance into a unified, multi-modal representation. The fusion process involves hierarchical integration where rhythmic cues and musical structure provide context for dance generation, enriched by semantic instructions from the LLM.
Diffusion-based Motion Synthesis employs a denoising diffusion probabilistic model to iteratively refine motion sequences towards realistic, musically-aligned dance performances. The authors present a comprehensive loss scheme that includes kinematic and multi-modal alignment components, augmenting the coherence and expressiveness of generated movements.

Evaluation and Results

The evaluation includes a robust set of experiments utilizing the AIST++ dataset, demonstrating that DanceChat outperforms several contemporary approaches in generating physically plausible, rhythmically aligned, and diverse dance motions. The model achieves a lower Physical Foot Contact (PFC) score compared to baseline methods, indicating a superior level of physical realism. Additionally, the DanceChat's diversity metrics in kinematic space are closer to ground truth distributions, showcasing an enhanced capability to generate varied and lifelike dance sequences. While the Beat Alignment Score (BAS) reflects strong but not top-tier rhythmic conformity, the qualitative insights from the user paper reveal a significant preference for DanceChat over competing models like FACT and EDGE, further attesting to its effectiveness.

Implications and Future Research

The proposed DanceChat model has significant implications for the fields of entertainment, virtual avatar development, and human-computer interaction, where automated generation of dance from music is increasingly applicable. By injecting LLM-derived semantics into the generation loop, this paper underscores the transformational potential of integrating LLMs into non-linguistic domains like motion synthesis. Practically, such advancements could enrich virtual reality experiences with more authentic avatars, enhance choreographic exploration, and facilitate creative AI applications.

Moving forward, possibilities for further enhancing DanceChat could include developing specialized LLMs tailored for dance instruction, optimizing temporal alignment and choreography realism, and expanding training datasets to include diverse dance genres and non-Western forms. The paper hints at these possibilities, suggesting a robust path for integrating AI-driven choreography into broader cultural and creative contexts. Future research could also explore the seamless integration of real-time user feedback into the dance generation pipeline, allowing for interactive creative sessions between humans and AI choreographers.

In summary, the paper sets a credible foundation for future AI-driven dance synthesis, tying together music, language, and dance within a cohesive generational paradigm. Such a framework could catalyze further innovation in artistic AI applications, offering new pathways to explore the intersection of technology and artistic expression.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/ArxivSound/status/1933374721594204239

YouTube

Show All Videos