- The paper introduces a hierarchical reinforcement learning framework that leverages muscle‐driven musculoskeletal models to achieve sub-millimeter precision in piano key strikes.
- The paper demonstrates innovative techniques including adaptive sampling and latent space distillation to enable robust, emergent bimanual coordination with physiologically validated muscle activations.
- The paper validates its approach with strong empirical results, achieving an F1 score of 0.94 and alignment with human EMG data for nuanced, biomechanically realistic hand control.
Introduction and Motivation
The paper "MUSIC: Learning Muscle-Driven Dexterous Hand Control" (2604.23886) presents a framework for synthesizing physiologically plausible, high-precision hand motions for piano performance using muscle-driven musculoskeletal hand models. The system targets the challenges of controlling over-actuated, nonlinear dynamical systems inherent in biological hands, particularly under the demanding conditions of elite piano playing—requiring rapid, bimanual coordination and precise timing, all constrained by biomechanical realism.
Rather than relying on kinematic or joint-level control (which is limited in both expressiveness and biological plausibility), the approach unifies reinforcement learning, hierarchical latent-space abstraction, and anatomic model enhancement to deliver physics-based dexterous control. The method can generalize to novel sheet music outside the reference data, produces emergent fingerings without explicit fingering annotation, and achieves state-of-the-art (SOTA) F1 scores for piano key strikes with physiologically validated muscle activations. The system moves closer to bridging the gap between computational motion synthesis and real human dexterity.
System Architecture and Methodology
The control pipeline comprises three main stages, structured as a hierarchical architecture to separate muscle-level execution from music-conditioned decision making. This enables sample-efficient learning, generalizable control, and biomechanically realistic actuation.
- Low-Level Muscle Tracking (Stage 1):
Single-hand RL policies operate at 480Hz, mapping proprioception and target pose embeddings to muscle activation vectors. These are trained to track pose trajectories from a long-tail distribution, 10-hour-scale elite piano motion dataset, with adaptive MCMC-based sampling to focus learning on challenging trajectory segments.
To abstract away high-frequency muscle dynamics, tracking policies are distilled into a variational autoencoder (VAE). The VAE’s low-dimensional latent (updated at 60Hz) feeds a deterministic decoder, replicating the tracking policy’s muscle activations, thereby regularizing exploration for the higher-level controller.
- High-Level Music-Conditioned Control (Stage 3):
Over the VAE latent space, bimanual policies are trained (one per hand) for music-conditioned synthesis. Inputs are note-event sequences extracted from scores, represented as temporally extended indicator vectors, coupled with shared inter-hand state for coordinated motion. High-level control is posed as decentralized multi-agent RL with adversarial imitation objectives, leveraging multi-objective PPO and discriminator ensembles for stable transfer of reference motion features.
This architecture is trained—using adaptive sampling based on key-strike recall scores—to efficiently cover difficult musical passages.
Figure 1: System overview illustrating the three-stage training pipeline: single-hand tracking, VAE latent distillation, and decentralized high-level music-conditioned policy learning.
The architectural separation allows the low-level controller to focus on biomechanically precise execution, while the high-level controller plans goal-directed, coordinated actions over an abstracted control interface, improving both efficiency and robustness.
Musculoskeletal Hand Model Enhancement
The underlying anatomical model builds on MyoHand, with significant augmentation: five additional muscles (FPB, APB, AdP, FDM, ADM) are included to recover precise abduction, adduction, and flexion in the thumb and pinky. This results in a model with 44 muscle-tendon actuators and 50 controllable DoF per hand, enabling nuanced independent finger dynamics critical for piano techniques (e.g., extended chords, glissandi, complex crossings).

Figure 2: Visualization of the enhanced musculoskeletal hand model, with additional musculotendon units highlighted, supporting refined thumb and pinky control.
Biomechanics-oriented simulation is provided by MuJoCo, enforcing activation and deactivation time constants consistent with physiological evidence, with root actuation placed at the elbow, allowing full-range, high-fidelity hand dynamics.
Numerical Results and Empirical Evaluation
The system was evaluated on 15 diverse music excerpts, each outside the training repertoire, spanning durations of 15-32 seconds with up to 350 note events. Key evaluation metrics include tracking error (mm-scale) for muscle-driven imitation and F1 score for successful piano key strikes.
- Low-level Tracking Precision:
The enhanced model achieves <4 mm error for all fingertips/wrist, significantly outperforming the original MyoHand (thumb error: 2.3±3.3 mm vs. 12.1±8.0 mm).
Adaptive sampling delivers robust performance across challenging trajectories, reducing tracking variability (Table 1).
- High-level Bimanual Music Synthesis:
The muscle-driven policy consistently surpasses F1 =0.9 across all novel music pieces (mean =0.94), matching the performance of a joint-driven PD baseline on several pieces.
Figure 3: F1 scores per musical piece for muscle-driven and joint-driven models, illustrating SOTA performance for the proposed muscle-control policy.
The system exhibits emergent fingering allocation, coordinated finger occupancy across temporally extended goals, and manages non-trivial behaviors—arpeggios, large hand leaps, and inter-hand overlap.





Figure 4: Policy-inferring finger assignments dynamically for overlapping target keys, preventing premature release and ensuring efficient transitions.
EMG measurements obtained from human subjects are strongly aligned with the sparse, staggered muscle activation patterns generated by the controller during motion tracking and piano performance, extending even to fine-grained agonist–antagonist alternation at the individual muscle group level.


Figure 5: Comparison of model-generated muscle activations to human EMG during representative hand movements, validating physiological fidelity.
Ablations and Sensitivity Analysis
The study provides thorough analyses:
Implications, Applications, and Future Directions
Practical Implications
The integration of hierarchical RL over physiologically plausible models makes the framework directly relevant for:
- AI-powered digital character animation in interactive media, benefiting from both realism and generalizability.
- Biomechanical analysis and motor learning—the alignment with EMG data suggests insight into human strategy and fatigue modeling.
- Biorobotics and prosthetic design—techniques for robust, efficient muscle-driven control transfer to hardware control and rehabilitation.
Theoretical Developments
Methodologically, the work:
- Validates the efficacy of latent-variable abstraction for bridging sample complexity and exploration barriers in high-dimensional nonlinear control.
- Establishes decentralized multi-agent RL as a tractable and performant paradigm for bimanual dexterous manipulation with strict temporal and spatial coupling.
- Demonstrates controller composability across drastically different simulation and goal spaces (motion tracking → music synthesis), which may generalize to other complex, temporally structured tasks (e.g., multi-finger manipulation, skilled throwing, or complex sports tasks).
Limitations and Future Prospects
The system presently operates under binary key-press sound triggering, limiting expressive nuance. Advancements could incorporate velocity-based or physically grounded key-actuation models to synthesize more natural sound dynamics and articulation. Additionally, the occasional emergence of stiff or abrupt motions calls for higher-level regularization (e.g., muscle fatigue, hierarchical trajectory priors). Inclusion of more comprehensive arm/forearm motion data and refinement of hand assignment heuristics (beyond the clef-based rule) would further close the gap toward fully human-like performance inference.
Extending this pipeline to other forms of dexterous manipulation, whole-body muscle-driven synthesis, or adaptive transfer onto robotic hardware are logical and promising directions.
Conclusion
The proposed hierarchical, muscle-driven control pipeline achieves high-fidelity, physiologically plausible hand control for bimanual piano performance, leveraging hierarchical RL, anatomical model enhancements, and adaptive training strategies. The strong empirical results—sub-millimeter tracking accuracy, SOTA F1 scores, and validated EMG congruence—demonstrate that high-precision, muscle-based actuation is viable for complex creative tasks far beyond simplified manipulation. The framework constitutes a robust foundation for next-generation physics-based digital avatars and contributes valuable insights into biological motor control, with clear potential for impactful extensions in AI, biomechanics, and robotics.