PianoFlow: Music-Aware Streaming Piano Motion Generation with Bimanual Coordination

Published 14 Apr 2026 in cs.CV | (2604.12856v2)

Abstract: Audio-driven bimanual piano motion generation requires precise modeling of complex musical structures and dynamic cross-hand coordination. However, existing methods often rely on acoustic-only representations lacking symbolic priors, employ inflexible interaction mechanisms, and are limited to computationally expensive short-sequence generation. To address these limitations, we propose PianoFlow, a flow-matching framework for precise and coordinated bimanual piano motion synthesis. Our approach strategically leverages MIDI as a privileged modality during training, distilling these structured musical priors to achieve deep semantic understanding while maintaining audio-only inference. Furthermore, we introduce an asymmetric role-gated interaction module to explicitly capture dynamic cross-hand coordination through role-aware attention and temporal gating. To enable real-time streaming generation for arbitrarily long sequences, we design an autoregressive flow continuation scheme that ensures seamless cross-chunk temporal coherence. Extensive experiments on the PianoMotion10M dataset demonstrate that PianoFlow achieves superior quantitative and qualitative performance, while accelerating inference by over 9\times compared to previous methods.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces PianoFlow, a two-stage flow-matching framework that innovatively employs cross-modal distillation for precise wrist trajectory and gesture synthesis.
It demonstrates state-of-the-art performance with FID=2.674 and over 9× faster inference, ensuring high-fidelity, temporally coherent streaming motion.
The methodology features an Asymmetric Role-Gated Interaction module and autoregressive flow continuation, enabling fine-grained, real-time bimanual coordination.

PianoFlow: Music-Aware Streaming Piano Motion Generation with Bimanual Coordination

Motivation and Problem Definition

Audio-driven human motion generation for instrument playing—specifically, high-fidelity bimanual piano motion—poses unique challenges, including fine-grained finger articulation, real-time inter-hand coordination, and dense semantic alignment with musical structure. Prevailing methods, predominantly diffusion or Transformer-based, exhibit three critical limitations: they lack symbolic (MIDI-based) musical priors, employ rigid cross-hand interactions, and are confined to resource-intensive, non-streaming, short-sequence generation. These bottlenecks undermine accurate and efficient synthesis of coordinated, physically plausible piano motions.

Methodological Framework

PianoFlow introduces a two-stage flow-matching architecture that leverages privileged multimodal distillation and an explicit role-gated bimanual interaction scheme, yielding music-aware wrist and gesture synthesis deployable for real-time streaming applications.

Figure 1: PianoFlow enables real-time, audio-driven bimanual piano motion synthesis, distilling structured musical priors via MIDI and introducing ARGI for adaptive bimanual coordination.

Stage 1: Music-Aware Wrist Trajectory Generation

A decoupled cross-modal distillation framework is adopted. A multimodal teacher, ingesting both MIDI and audio, encodes hierarchical harmonic structure via a Harmonic Perceiver, integrating octave/interval relationships with pitch-wise positional encodings using attention-based pooling. The student receives only audio, which is processed through a pre-trained MuQ encoder. Cross-modal alignment is enforced at both encoder and decoder, with progressive distillation to circumvent noisy early teacher signals.

Critically, each hand is independently modeled (unshared parameters) to prevent entropic collapse of hand-specific kinematics—an ablation study demonstrates significant degradation when this decoupling is ablated.

Stage 2: Bimanual Coordinated Gesture Generation

A conditional flow-matching backbone—factorized into spatial Transformers (intra-hand) and temporal U-Nets—synthesizes fine-grained gesture kinematics from predicted wrist trajectories and learned audio representations. Bimanual dependencies are explicitly modeled at the bottleneck via the Asymmetric Role-Gated Interaction (ARGI) module. ARGI integrates hand-identity cues through dedicated embeddings and a frame-wise, role-aware temporal gating mechanism that modulates cross-hand information exchange adaptively.

Inference leverages ODE solvers operating over the learned velocity fields. Statistical optimal transport provides a deterministic path for generation, ensuring stable and coherent solutions across samples.

Figure 2: The PianoFlow architecture: cross-modal distillation for wrist trajectory (Stage 1) and gesture synthesis via conditional flow-matching with ARGI for dynamic bimanual coordination (Stage 2).

Arbitrary-Length Streaming via Autoregressive Flow Continuation (AFC)

Typical sequence generation pipelines fail to deliver temporally coherent, real-time motion for long performances. AFC addresses this by employing a sliding-window approach, causally anchoring the tail of the previous window via optimal transport blending between historical and noise-initiated candidates. Integration is performed at every ODE step using a cosine mask for boundary smoothing, achieving temporal consistency in arbitrarily long motion streams—crucial for real-world interactive applications.

Experimental Evaluation

Experiments were conducted on the PianoMotion10M dataset, which provides synchronized audio, MIDI, and 3D MANO-based hand motion for large-scale piano performances.

Quantitative Results

PianoFlow yields state-of-the-art performance across all key metrics:

On short sequence generation, PianoFlow surpasses all baselines in Fréchet Inception Distance (FID = 2.674), Fréchet Gesture Distance (FGD), and spatio-temporal metrics (PD, Smoothness), with a Real-Time Factor (RTF) of 0.186—over 9× faster inference than prior approaches, and the only method supporting real-time, high-fidelity streaming.
For long sequences, PianoFlow with AFC outperforms sliding-window fusion on both global and hand-specific metrics, effectively preserving high-frequency motion (FDE) critical for expressive finger articulation and temporal realism.

Qualitative Results

Qualitative comparisons with S2C and PianoMotion underscore PianoFlow’s superior bimanual alignment and handling of difficult keyboard stretches. The model generates both precise and diverse fingering strategies for given passages, maintaining temporal and physical plausibility consistently across extended, complex musical segments.

Figure 3: Qualitative comparison: PianoFlow exhibits superior kinematic accuracy and cross-hand coordination relative to prior methods.

Ablation Studies

Replacing MuQ with HuBERT or removing MIDI distillation both degrade FID and kinematic alignment, underscoring the necessity of structured musical priors.
Removing decoupling or spatial modules impairs temporal stability and spatial accuracy.
Substituting or removing ARGI leads to less effective coordination, revealing the advantages of role-aware, temporally gated interaction over both independent modeling and vanilla cross-attention.

Broader Implications and Future Directions

PianoFlow provides a modular solution extensible to multimodal generative tasks that demand fine-motion control with strong semantic alignment (e.g., robotic music performance, digital human animation, AI-assisted music pedagogy, controllable audio-to-video synthesis). The involvements of privileged modalities and explicit, interpretable control modules such as ARGI present promising templates for future advances in cross-modal distillation, dynamic interaction modeling, and streaming generative systems.

Potential future work includes scaling architectures to even more complex polyphonic and multi-instrument scenarios, integrating physical simulation for manipulation tasks, and leveraging video-conditioned models for expressive full-body generation beyond hand kinematics.

Conclusion

PianoFlow establishes a new state-of-the-art for real-time, audio-driven bimanual piano motion synthesis by combining music-aware cross-modal distillation with role-gated, adaptive inter-hand coordination and efficient streaming via autoregressive flow continuation. It demonstrates compelling gains in both motion fidelity and computational efficiency, with broad potential for impact across animation, virtual performance, and AI-augmented music interaction domains (2604.12856).

Markdown Report Issue