MUSIC: Learning Muscle-Driven Dexterous Hand Control

Published 26 Apr 2026 in cs.GR and cs.AI | (2604.23886v1)

Abstract: We present a data-driven approach for physics-based, muscle-driven dexterous control that enables musculoskeletal hands to perform precise piano playing for novel pieces of music outside the reference dataset. Our approach combines high-frequency muscle-level control with low-frequency latent-space coordination in a hierarchical architecture. At the low level, general single-hand policies are trained via reinforcement learning to generate dynamic muscle-tendon activations while tracking trajectories from a large reference motion dataset. The resulting tracking policies are then distilled into variational autoencoder (VAE) models, yielding smooth and structured latent spaces that abstract away low-level muscle dynamics. For the high level, we train piece-specific policies to operate in this latent space, coordinating bimanual motions based on specific goals, denoted by note events extracted from given musical scores, to synthesize performances beyond the reference data. In addition, we present an enhanced musculoskeletal hand model that supports fine control of fingers for accurate low-level motion tracking and diverse high-level motion synthesis. We evaluate the control pipeline of our approach on a diverse piano repertoire spanning multiple musical styles and technical demands. Results demonstrate that our approach can synthesize coordinated bimanual motions with accurate key presses, and achieve the state-of-the-art performance of piano playing in physics-based dexterous control. We also show that our musculoskeletal hand model demonstrates superior biomechanical stability and tracking precision compared to the existing model, and validate that our musculoskeletal hand model and muscle-driven controller can generate physiologically plausible activation patterns that align with human electromyography (EMG) recordings.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces a hierarchical reinforcement learning framework that leverages muscle‐driven musculoskeletal models to achieve sub-millimeter precision in piano key strikes.
The paper demonstrates innovative techniques including adaptive sampling and latent space distillation to enable robust, emergent bimanual coordination with physiologically validated muscle activations.
The paper validates its approach with strong empirical results, achieving an F1 score of 0.94 and alignment with human EMG data for nuanced, biomechanically realistic hand control.

Hierarchical Muscle-Driven Dexterous Hand Control for Music Performance

Introduction and Motivation

The paper "MUSIC: Learning Muscle-Driven Dexterous Hand Control" (2604.23886) presents a framework for synthesizing physiologically plausible, high-precision hand motions for piano performance using muscle-driven musculoskeletal hand models. The system targets the challenges of controlling over-actuated, nonlinear dynamical systems inherent in biological hands, particularly under the demanding conditions of elite piano playing—requiring rapid, bimanual coordination and precise timing, all constrained by biomechanical realism.

Rather than relying on kinematic or joint-level control (which is limited in both expressiveness and biological plausibility), the approach unifies reinforcement learning, hierarchical latent-space abstraction, and anatomic model enhancement to deliver physics-based dexterous control. The method can generalize to novel sheet music outside the reference data, produces emergent fingerings without explicit fingering annotation, and achieves state-of-the-art (SOTA) F1 scores for piano key strikes with physiologically validated muscle activations. The system moves closer to bridging the gap between computational motion synthesis and real human dexterity.

System Architecture and Methodology

The control pipeline comprises three main stages, structured as a hierarchical architecture to separate muscle-level execution from music-conditioned decision making. This enables sample-efficient learning, generalizable control, and biomechanically realistic actuation.

Low-Level Muscle Tracking (Stage 1):

Single-hand RL policies operate at 480Hz, mapping proprioception and target pose embeddings to muscle activation vectors. These are trained to track pose trajectories from a long-tail distribution, 10-hour-scale elite piano motion dataset, with adaptive MCMC-based sampling to focus learning on challenging trajectory segments.

Latent Space Distillation (Stage 2):

To abstract away high-frequency muscle dynamics, tracking policies are distilled into a variational autoencoder (VAE). The VAE’s low-dimensional latent (updated at 60Hz) feeds a deterministic decoder, replicating the tracking policy’s muscle activations, thereby regularizing exploration for the higher-level controller.

High-Level Music-Conditioned Control (Stage 3):

Over the VAE latent space, bimanual policies are trained (one per hand) for music-conditioned synthesis. Inputs are note-event sequences extracted from scores, represented as temporally extended indicator vectors, coupled with shared inter-hand state for coordinated motion. High-level control is posed as decentralized multi-agent RL with adversarial imitation objectives, leveraging multi-objective PPO and discriminator ensembles for stable transfer of reference motion features.

This architecture is trained—using adaptive sampling based on key-strike recall scores—to efficiently cover difficult musical passages.

Figure 1: System overview illustrating the three-stage training pipeline: single-hand tracking, VAE latent distillation, and decentralized high-level music-conditioned policy learning.

The architectural separation allows the low-level controller to focus on biomechanically precise execution, while the high-level controller plans goal-directed, coordinated actions over an abstracted control interface, improving both efficiency and robustness.

Musculoskeletal Hand Model Enhancement

The underlying anatomical model builds on MyoHand, with significant augmentation: five additional muscles (FPB, APB, AdP, FDM, ADM) are included to recover precise abduction, adduction, and flexion in the thumb and pinky. This results in a model with 44 muscle-tendon actuators and 50 controllable DoF per hand, enabling nuanced independent finger dynamics critical for piano techniques (e.g., extended chords, glissandi, complex crossings).

Figure 2: Visualization of the enhanced musculoskeletal hand model, with additional musculotendon units highlighted, supporting refined thumb and pinky control.

Biomechanics-oriented simulation is provided by MuJoCo, enforcing activation and deactivation time constants consistent with physiological evidence, with root actuation placed at the elbow, allowing full-range, high-fidelity hand dynamics.

Numerical Results and Empirical Evaluation

The system was evaluated on 15 diverse music excerpts, each outside the training repertoire, spanning durations of 15-32 seconds with up to 350 note events. Key evaluation metrics include tracking error (mm-scale) for muscle-driven imitation and F1 score for successful piano key strikes.

Low-level Tracking Precision:

The enhanced model achieves $<4$ mm error for all fingertips/wrist, significantly outperforming the original MyoHand (thumb error: $2.3\pm3.3$ mm vs. $12.1\pm8.0$ mm). Adaptive sampling delivers robust performance across challenging trajectories, reducing tracking variability (Table 1).

High-level Bimanual Music Synthesis:

The muscle-driven policy consistently surpasses F1 $=0.9$ across all novel music pieces (mean $=0.94$ ), matching the performance of a joint-driven PD baseline on several pieces.

Figure 3: F1 scores per musical piece for muscle-driven and joint-driven models, illustrating SOTA performance for the proposed muscle-control policy.

Emergent Dexterity:

The system exhibits emergent fingering allocation, coordinated finger occupancy across temporally extended goals, and manages non-trivial behaviors—arpeggios, large hand leaps, and inter-hand overlap.

Figure 4: Policy-inferring finger assignments dynamically for overlapping target keys, preventing premature release and ensuring efficient transitions.

Biomechanical Realism:

EMG measurements obtained from human subjects are strongly aligned with the sparse, staggered muscle activation patterns generated by the controller during motion tracking and piano performance, extending even to fine-grained agonist–antagonist alternation at the individual muscle group level.

Figure 5: Comparison of model-generated muscle activations to human EMG during representative hand movements, validating physiological fidelity.

Ablations and Sensitivity Analysis

The study provides thorough analyses:

Adaptive Sampling: Substantially reduces error and accelerates convergence for both tracking and synthesis.
Latent Control Space: VAE-based latent controllers outperform normalized tracking embeddings and joint-space tracking in both F1 and subjective stability.
Multi-agent vs. Centralized Control: Decentralized MARL accelerates training (by ~30%) and yields higher steady-state performance than a monolithic centralized policy.
Conditional VAE Priors: Vanilla VAE design is favored due to time-scale separation in hierarchical control.
Figure 6: MARL learning curves versus centralized controller, highlighting faster convergence and superior terminal accuracy for the decentralized approach.

Implications, Applications, and Future Directions

Practical Implications

The integration of hierarchical RL over physiologically plausible models makes the framework directly relevant for:

AI-powered digital character animation in interactive media, benefiting from both realism and generalizability.
Biomechanical analysis and motor learning—the alignment with EMG data suggests insight into human strategy and fatigue modeling.
Biorobotics and prosthetic design—techniques for robust, efficient muscle-driven control transfer to hardware control and rehabilitation.

Theoretical Developments

Methodologically, the work:

Validates the efficacy of latent-variable abstraction for bridging sample complexity and exploration barriers in high-dimensional nonlinear control.
Establishes decentralized multi-agent RL as a tractable and performant paradigm for bimanual dexterous manipulation with strict temporal and spatial coupling.
Demonstrates controller composability across drastically different simulation and goal spaces (motion tracking $\to$ music synthesis), which may generalize to other complex, temporally structured tasks (e.g., multi-finger manipulation, skilled throwing, or complex sports tasks).

Limitations and Future Prospects

The system presently operates under binary key-press sound triggering, limiting expressive nuance. Advancements could incorporate velocity-based or physically grounded key-actuation models to synthesize more natural sound dynamics and articulation. Additionally, the occasional emergence of stiff or abrupt motions calls for higher-level regularization (e.g., muscle fatigue, hierarchical trajectory priors). Inclusion of more comprehensive arm/forearm motion data and refinement of hand assignment heuristics (beyond the clef-based rule) would further close the gap toward fully human-like performance inference.

Extending this pipeline to other forms of dexterous manipulation, whole-body muscle-driven synthesis, or adaptive transfer onto robotic hardware are logical and promising directions.

Conclusion

The proposed hierarchical, muscle-driven control pipeline achieves high-fidelity, physiologically plausible hand control for bimanual piano performance, leveraging hierarchical RL, anatomical model enhancements, and adaptive training strategies. The strong empirical results—sub-millimeter tracking accuracy, SOTA F1 scores, and validated EMG congruence—demonstrate that high-precision, muscle-based actuation is viable for complex creative tasks far beyond simplified manipulation. The framework constitutes a robust foundation for next-generation physics-based digital avatars and contributes valuable insights into biological motor control, with clear potential for impactful extensions in AI, biomechanics, and robotics.

Markdown Report Issue