PianoVAM: Multimodal Piano Dataset
- PianoVAM is a multimodal piano performance dataset offering synchronized recordings from audio, video, MIDI, hand poses, and fingering labels for detailed analysis.
- It employs innovative cross-modal synchronization and semi-automated fingering annotation to ensure high annotation precision and ecological validity.
- The dataset serves as a benchmark for audio-only and audio-visual transcription tasks, advancing research in music information retrieval.
PianoVAM is a multimodal piano performance dataset designed to facilitate advanced research in music information retrieval (MIR), piano transcription, performance gesture analysis, and multimodal learning. The dataset is distinguished by its comprehensive capture of solo piano practice sessions, including synchronized video, audio, symbolic MIDI, annotated hand landmarks, fingering labels, and rich contextual metadata. It was recorded using a Disklavier piano and amateur performers in realistic daily practice environments, emphasizing practical variability and ecological validity. PianoVAM introduces robust methodologies for cross-modal synchronization, fingering annotation, and benchmarking for both audio-only and audio-visual transcription tasks.
1. Composition and Modalities
PianoVAM comprises the following modalities for each solo piano performance:
- Video: Top-view at 1080p, 60 fps, capturing full-body and hand motions throughout real practice sessions.
- Audio: Acoustic output at 44.1 kHz mono, recorded concurrently via dedicated microphone and Disklavier hardware.
- MIDI: Symbolic performance data (51 channels) recorded directly from the Disklavier’s onboard sensors, enabling precise event-level alignment.
- Hand Landmarks: Keypoints for hand pose extracted per frame using MediaPipe Hands, providing fine-grained 2D representations for each joint.
- Fingering Labels: Semi-automated fingering annotations correlating hand landmarks and MIDI note events, supplemented by manual correction through a GUI in case of ambiguity (approx. 20% of notes).
- Metadata: Performer identity, self-rated skill level, piece composer/style, and session type (“DailyPractice”).
All recordings were acquired in unsupervised, user-controlled sessions, with QR codes for individual identification and session management.
2. Data Acquisition and Alignment
Data collection incorporated the following workflow:
- Pianists registered as users and controlled session flow using QR codes visible to the overhead camera.
- OBS Studio captured synchronized video and audio; Logic Pro recorded MIDI and parallel audio.
- A control script coordinated the launch and termination of recording software.
- Cross-modal synchronization leveraged a shared audio track recorded on both systems. Alignment was refined by:
- Down-mixing audio to mono.
- Resampling audio to 22.05 kHz.
- Synthesizing MIDI signals via FluidSynth.
- Computing Constant-Q Transform representations on both audio streams.
- Applying Dynamic Time Warping (DTW) within a Sakoe-Chiba band of ±2.5 seconds:
This ensures precise correspondence of events across audio, MIDI, and video.
3. Hand Landmark and Fingering Annotation
Hand pose annotation is performed on top-view video streams using pretrained MediaPipe Hands. Fingering labels are generated via a hybrid semi-automatic approach:
For each MIDI note onset, the algorithm identifies candidate fingers by inspecting video frames where a fingertip overlaps the target key region, aggregating over temporal windows.
Fingering scores are computed per candidate: a score of 1 for perfect overlap; intermediate values receive a correction factor.
Candidates are classified as “normal” (>50% frame overlap) or “strong” (>80%). If one strong candidate exists, automatic assignment is performed; otherwise, manual selection is prompted via a custom GUI.
Disambiguation is supported by a 3D hand “floating” check. Using pixel-to-space projection and scalar multipliers for wrist, index, and ring metacarpals:
$\left| (x_I t, y_I t, t) - (x_W u, y_W u, u) \right| \approx |I_0 W_0| \ \left| (x_W u, y_W u, u) - (x_R v, y_R v, v) \right| \approx |W_0 R_0| \ \left| (x_R v, y_R v, v) - (x_I t, y_I t, t) \right| \approx |R_0 I_0| \$
With , Powell’s dog leg algorithm estimates ; the mean flags “floating” hands if .
This methodology yields fingering annotation precision above 95%, with error predominantly in cases of adjacent finger ambiguity.
4. Signal Normalization and Visual Disambiguation
Loudness Normalization: Acoustic variability across practice sessions is corrected using
pyloudnorm
:- MIDI is synthesized (Disklavier SoundFont).
- Integrated loudness (LUFS) is measured.
- Gain offset applied for a target global average of −23 LUFS.
- Visual Ambiguities: Motion blur, shadows, and hand overlap are mitigated by combining z-depth filtering and manual GUI review when landmark-to-key correlation is unclear.
5. Benchmarking and Performance Analysis
Benchmark experiments are conducted for both audio-only and audio-visual piano transcription:
- Models: The “Onsets and Frames” architecture was trained on PianoVAM, MAESTROv3, and the union of both.
- Performance: Training with PianoVAM yields superior Note and Velocity F1 scores relative to MAESTROv3; improvement is statistically validated via Friedman and post-hoc Wilcoxon tests ( in selected metrics).
- Audio-Visual Pipeline:
- Predicted onsets trigger retrieval of corresponding video frames.
- Perspective transformation standardizes keyboard view using known corner coordinates.
- Candidate pitch regions extracted from fingertip x-coordinates (±2 white keys around predicted location).
- Under adverse SNR or reverberation, visual filtering consistently improves precision and overall F1.
A plausible implication is that the inclusion of hand pose and fingering annotation provides a robust prior for transcription models, particularly in noisy environments.
6. Applications and Research Directions
PianoVAM’s multimodal richness enables several advanced MIR tasks:
- Pedagogical Analysis: Enables systematic paper of fingering strategies across skill levels and piece genres.
- Gesture-Based Musical Analysis: Facilitates research into expressiveness, body language, and gesture-performance relationships in practical settings.
- Audio-Visual Source Separation: Provides substrate for models exploiting hand movement cues to disambiguate acoustic streams.
- Comparative Performance Studies: Permits analysis of differences in practice versus concert conditions using matched multimodal signals.
This suggests that PianoVAM may be used not only to advance transcription accuracy but also to model performance nuances and support cross-modal learning paradigms.
7. Context and Significance in Multimodal MIR Research
PianoVAM fills a notable gap in existing datasets by providing highly synchronized, richly annotated multimodal piano performance data acquired under authentic practice scenarios. Unlike datasets that focus solely on concert or studio performances, PianoVAM prioritizes ecological validity and practical variability. Its processing and annotation pipelines directly address well-known challenges in cross-modality alignment, fingering annotation, and transcription benchmarking.
PianoVAM is suited for methodological research in multimodal music analysis, data-driven piano pedagogy, and MIR benchmarking. It complements symbolic-only datasets (such as Aria-MIDI (Bradshaw et al., 21 Apr 2025)) and multi-annotator resources (such as PIAST (Bang et al., 4 Nov 2024)) by introducing video, pose, and gesture data in highly controlled, synchronized form. The dataset’s open structure provides substantial opportunities for new algorithm development in multimodal neural learning, audio-visual transcription, and cross-domain retrieval.