BiTDiff: Fine-Grained 3D Conducting Motion Generation via BiMamba-Transformer Diffusion

Published 6 Apr 2026 in cs.CV and cs.MM | (2604.04395v1)

Abstract: 3D conducting motion generation aims to synthesize fine-grained conductor motions from music, with broad potential in music education, virtual performance, digital human animation, and human-AI co-creation. However, this task remains underexplored due to two major challenges: (1) the lack of large-scale fine-grained 3D conducting datasets and (2) the absence of effective methods that can jointly support long-sequence generation with high quality and efficiency. To address the data limitation, we develop a quality-oriented 3D conducting motion collection pipeline and construct CM-Data, a fine-grained SMPL-X dataset with about 10 hours of conducting motion data. To the best of our knowledge, CM-Data is the first and largest public dataset for 3D conducting motion generation. To address the methodological limitation, we propose BiTDiff, a novel framework for 3D conducting motion generation, built upon a BiMamba-Transformer hybrid model architecture for efficient long-sequence modeling and a Diffusion-based generative strategy with human-kinematic decomposition for high-quality motion synthesis. Specifically, BiTDiff introduces auxiliary physical-consistency losses and a hand-/body-specific forward-kinematics design for better fine-grained motion modeling, while leveraging BiMamba for memory-efficient long-sequence temporal modeling and Transformer for cross-modal semantic alignment. In addition, BiTDiff supports training-free joint-level motion editing, enabling downstream human-AI interaction design. Extensive quantitative and qualitative experiments demonstrate that BiTDiff achieves state-of-the-art (SOTA) performance for 3D conducting motion generation on the CM-Data dataset. Code will be available upon acceptance.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces the BiTDiff framework, a groundbreaking model leveraging BiMamba-Transformer diffusion to generate fine-grained, 3D conducting motions.
It presents the CM-Data dataset, the first extensive 3D motion collection with detailed hand, body, and facial expressions for music-synchronized performances.
Experimental results show that BiTDiff significantly outperforms baselines in fidelity, diversity, and beat alignment while efficiently synthesizing long-horizon sequences.

Fine-Grained 3D Conducting Motion Generation via BiMamba-Transformer Diffusion

Introduction

The automation of expressive 3D conducting motion, synchronized with music, is technologically significant for music education, digital human animation, and human–AI co-creation. Previous approaches struggle with two dominant challenges: the lack of a comprehensive, high-fidelity dataset capturing fine-grained hand, body, and facial expressions of conductors in 3D, and the absence of efficient, high-quality generative models supporting temporally consistent long-sequence motion synthesis. This paper presents BiTDiff, which addresses both data and methodology deficiencies through the introduction of the CM-Data dataset and a BiMamba-Transformer Diffusion model supporting joint-level motion editing and state-of-the-art generation fidelity.

Dataset: CM-Data Construction and Properties

A major contribution is CM-Data—the first large-scale, fine-grained, public 3D conducting motion dataset. CM-Data is built on a robust, quality-oriented acquisition pipeline that utilizes multi-stage model fusion (PromptHMR for coarse body, HaPTIC for articulate hands, and SPECTRE for facial dynamics), resulting in temporally stable SMPL-X sequences with full-body coverage, detailed hand articulation, and facial expression:

Figure 1: The CM-Data pipeline captures both coarse and fine motion details, including hand and facial expressivity, with accurate temporal stability visualized in multiple views.

The dataset contains roughly 1,500 curated and processed samples totaling about 10 hours, spanning a diversity of conductors, genres, and scenarios (orchestral, choral, solo performances). It provides detailed per-frame annotation (root translations, 6D joint rotations), supporting scalable benchmarks for modeling expressive and realistic conducting.

Methodology: BiTDiff Framework

Generative Model

The generative pipeline is formalized as a conditional diffusion process mapping music features to SMPL-X motion sequences. The model leverages a hybrid BiMamba–Transformer architecture:

Figure 2: System overview—BiTDiff’s architecture (left) combines BiMamba for intra-modal temporal modeling and a Transformer block for cross-modal music-motion alignment; the right panel visualizes flexible inference with diffusion steps.

BiMamba processes intra-modal dependencies (i.e., fine-grained temporal structure in motion or music) in both forward and backward directions, enhancing context modeling for non-causal tasks and supplying long-sequence modeling efficiencies that outperform classical autoregressive or transformer-only schemes.
Transformer is employed as a cross-modal attention module, integrating global musical context to refine alignment between musical structure (beat, chroma, MFCCs) and expressive motion cues.

Auxiliary training objectives include hand/body-specific forward kinematics (FK) loss terms, promoting precise physical consistency, particularly for articulate hand and upper-body movements, and a velocity loss to encourage temporal smoothness. Joint training with classifier-free guidance enables musically controllable sampling.

Non-Autoregressive, Efficient Long-Horizon Synthesis

BiTDiff generates full-length sequences (up to 4,096 frames) non-autoregressively, avoiding exposure bias and inefficient segment-level inpainting. The bidirectional Mamba backbone ensures efficient memory scaling and uninterrupted long-term coherence.

Motion Editing Capabilities

The model supports fine-grained, training-free motion editing via masked denoising—for instance, temporal in-betweening, streaming continuation, lower-body synthesis conditional on upper-body input, and body-to-hand/face enrichment—all of which facilitate downstream interactive or creative workflows.

Experimental Results

Quantitative and Qualitative Performance

BiTDiff is empirically validated against leading baselines in 2D conducting, 3D dance, and gesture generation on CM-Data, demonstrating clear superiority across fidelity (FID), diversity (DIV), and beat alignment (BAS):

Figure 3: Qualitative results highlight the expressivity, temporal stability, and fine-grained detail of BiTDiff-generated motions compared to SOTA baselines.

Notably, BiTDiff reduces average FID by 18.1, increases DIV by 2.5, and improves BAS by 2.0 relative to the best prior method (Lodge). Latency analysis shows that BiTDiff achieves the fastest inference times, especially advantageous for long-sequence practical deployment. User studies with conducting experts confirm the model’s superiority in synchronization, fidelity, and especially in creativity (4.17/5).

Ablation Studies

Ablation indicates strong contributions from both the velocity loss (improved body dynamics) and the hand/body FK decomposition (marked improvement in hand motion realism). The hybrid architecture (BiMamba-Transformer) provides the optimal trade-off between quality and efficiency: unidirectional or pure Transformer backbones yield either degraded expressivity or increased inference cost, respectively.

Motion Editing

Figure 4: Motion editing visualization—BiTDiff supports temporally and spatially localized manipulation of generated sequences for flexible creative control.

BiTDiff exhibits robust inpainting, continuation, and enrichment capabilities, supporting real-time, user-guided joint-level motion control, which is not supported by prior methods.

Implications and Future Work

BiTDiff’s architecture demonstrates that bidirectional SSMs (Selective State Space Models) combined with transformative cross-modal attention enable highly efficient and expressive music-to-motion mappings at unprecedented levels of detail and control. The introduction of CM-Data sets a new standard for benchmarking, facilitating further advances in fine-grained multimodal motion synthesis. Future directions include integration of text-based control paradigms, expansion to multimodal editing, and practical deployments in digital performance or human-AI conducting ensembles.

Conclusion

This work fills a pivotal gap in fine-grained 3D conducting motion generation by introducing both the largest available dataset (CM-Data) and a hybrid generative framework (BiTDiff) that sets a new state-of-the-art in expressivity, temporal coherence, controllability, and efficiency. The principles established here—bidirectional SSMs for long-form generation and hierarchical control for joint-level editing—are likely to inform future research at the intersection of music, motion, and human-centered generative AI.

Markdown Report Issue