URMP Dataset: Multi-Modal Music Performance Corpus

Updated 29 June 2026

URMP dataset is a multi-modal music performance collection providing synchronized audio, video, scores, and ground-truth annotations.
It employs a sophisticated recording protocol that aligns individual tracks at millisecond precision while capturing expressive performance nuances.
The dataset supports diverse MIR tasks such as multi-pitch estimation, score-informed source separation, and visually informed analysis with validated metrics.

The University of Rochester Multi-Modal Music Performance (URMP) Dataset is a curated corpus designed to enable rigorous analysis of music performance using both audio and visual modalities. It consists of 44 chamber music pieces of varying ensemble size, assembled from individually recorded tracks with ensemble-level synchronization and extensive ground-truth annotations. The URMP dataset aims to provide a benchmark for both established music information retrieval (MIR) tasks and emergent multi-modal research at the audio-visual interface, with its design emphasizing millisecond-level alignment, high-fidelity isolated tracks, and performance expressiveness (Li et al., 2016).

1. Composition and Structure

The URMP collection spans 44 classical chamber works, partitioned as 11 duets, 12 trios, 14 quartets, and 7 quintets. Instrumentation covers strings (violin, viola, cello, double bass) and extends to woodwinds and brass in mixed quintets (e.g., flute, oboe, clarinet, horn, trumpet). For each piece, the dataset includes:

MIDI and PDF scores for machine-readable and human reference.
Individual “dry” audio tracks per instrument (WAV, 48 kHz/24-bit) and high-quality ensemble mixes.
Full ensemble video (MP4, H.264, 1920×1080 at 29.97 fps) composited from isolated recordings.
Frame-level and note-level ground-truth annotation files.

Each element is temporally synchronized to a global timeline.

Content	Format / Specification	Provided Per Piece
Isolated instrument audio	WAV, 48kHz/24-bit	Yes
Ensemble mixed audio	WAV, 48kHz/24-bit	Yes
Assembled ensemble video	MP4, H.264, 1920×[email protected]	Yes
MIDI + engraved score	.mid, .pdf	Yes
Ground-truth F0, notes	ASCII	Yes

2. Annotations and Ground Truth

Two primary ground-truth annotation levels are provided and meticulously verified:

Frame-level pitch trajectories: Extracted at 10 ms intervals (using pYIN/Tony), supplied in Hertz with unquantized continuous trajectories. Each entry encodes $(t_i, f_{0,i})$ per frame (or “NaN” for unvoiced).
Note-level transcriptions: Each line specifies onset (s), offset (s), and pitch (Hz), again without MIDI quantization.

Both levels undergo manual correction for errors in detected onset/offsets and note events, referencing both spectrograms and musical scores. This renders URMP suitable for precise algorithmic evaluation requiring continuous pitch targets.

3. Synchronization and Recording Methodology

A critical challenge addressed in URMP is achieving both expressive musical timing (including tempo fluctuations and dynamics) and tight cross-performer alignment. The adopted “A7” protocol comprises:

Studio recording of a conductor + pianist video, wherein tempo rubato and nuanced expressive elements are encoded by the conductor’s gestures and pianist’s realization.
Individual instrumentalists record their parts in isolation, guided (visual and audible) by the conductor-piano video and earphones.

For synchronization, raw videos/audios are first auto-aligned via cross-correlation:

$\tau^* = \arg\max_{\tau}\, \sum_{n} x[n]\,y[n+\tau]$

Subsequently, fine-grained manual adjustments based on transient onsets bring the performances into millisecond-level synchrony. The final presentation composits green-screened performer videos and mixes the audios.

4. Quality Evaluation

Quantitative Assessment

Onset deviation for $P\geq2$ simultaneous notes is measured:

$\delta_{\max} = \max_{i,j\in\{1\dots P\}}|t_i^{\rm onset} - t_j^{\rm onset}|$

URMP achieves a median $\delta_{\max}\approx 14$  ms (25th–75th: ~20–60 ms), outperforming datasets such as Bach10 (≥60 ms) and approaching chamber rehearsal realism (WWQ ≤10 ms).

Mean absolute timing error for $N$ events:

$\Delta t = \frac{1}{N}\sum_{k=1}^{N}|t_k^{\rm ref} - t_k^{\rm rec}|$

Subjective Assessment

8 non-expert listeners ranked URMP’s mixture quality against alternatives in 32 triplet comparisons: URMP was ranked first in 9, second in 17, and third in 6 cases.

5. MIR Benchmark Tasks

Multi-Pitch Analysis (Audio Only)

Benchmarks using Duan et al. (2010, 2014) for multi-pitch estimation (MPE) and multi-pitch streaming (MPS) report frame-wise accuracy:

$\mathrm{Accuracy} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP} + \mathrm{FN}}$

Results for quartets: URMP MPE median ≈ 60%, MPS ≈ 55% (vs. Bach10: MPE ≈ 80%, MPS ≈ 75%).

Score-Informed Source Separation

Score alignment employs DTW on chroma features:

$C(i,j) = d(\chi_i, \chi_j') + \min\{C(i-1,j), C(i,j-1), C(i-1,j-1)\}$

Harmonic masking is applied for separation. Quality is measured via Signal-to-Distortion Ratio (SDR) and relative improvement:

$\mathrm{SDR} = 10\log_{10} \frac{\|s_{\rm target}\|^2}{\|e_{\rm total}\|^2} \quad,\quad \Delta\mathrm{SDR} = \mathrm{SDR}_{\rm sep} - \mathrm{SDR}_{\rm mix}$

Quartet median $\tau^* = \arg\max_{\tau}\, \sum_{n} x[n]\,y[n+\tau]$ 0SDR is ≈ 4 dB (comparable to Bach10 at ≈ 4.5 dB).

URMP’s synchronized multi-camera video and high-resolution audio enable tasks not addressable with audio datasets alone.

Visually Informed Multi-Pitch Analysis

By extracting bowing motion magnitude from optical flow (Sun et al., 2010), “play/non-play” (P/NP) status is inferred for each frame—using thresholding or SVM classification—which constrains audio-based pitch tracking. Achieved accuracy improvements: MPE +5–12%, MPS +2–8% over audio-only.

Polyphonic Vibrato Analysis

Two subtasks are demonstrated:

(A) Vibrato note detection: Ground-truth labels are assigned by auto-correlation and thresholding of reference pitch contours. Detection evaluated via precision, recall, and F1 score:

$\tau^* = \arg\max_{\tau}\, \sum_{n} x[n]\,y[n+\tau]$ 1

Video-based left-hand motion SVM achieves $\tau^* = \arg\max_{\tau}\, \sum_{n} x[n]\,y[n+\tau]$ 2 (polyphony up to 4), with audio-only degrading rapidly.

(B) Vibrato parameter estimation: Left-hand principal motion is extracted from video (PCA on optical flow), amplitude normalized to local pitch, and vibrato rate/extent computed. Absolute errors are:

$\tau^* = \arg\max_{\tau}\, \sum_{n} x[n]\,y[n+\tau]$ 3

Mean errors: $\tau^* = \arg\max_{\tau}\, \sum_{n} x[n]\,y[n+\tau]$ 4 Hz, $\tau^* = \arg\max_{\tau}\, \sum_{n} x[n]\,y[n+\tau]$ 5 cents; 90% of errors within 1 Hz/10 cents.

7. Future Research Directions

URMP enables research in several emergent areas at the intersection of audio, vision, and music performance:

Visually informed source separation, using motion-guided masks.
Audio–visual source association, mapping detected audio events to visual performers.
Cross-modal generation (spectrogram-to-video and vice versa) via deep generative models (e.g., GANs, seq2seq).
Performer technique recognition (bow direction, embouchure, articulation).
Visual MIR for wind and brass instruments via subtle motion cues.

This suggests ongoing expansion of performance analysis from isolated auditory or visual perspectives toward holistic multi-modal frameworks. URMP’s granular annotation and high-fidelity multi-track design make it an archetype for both robust MIR evaluation and cross-modal algorithm development (Li et al., 2016).

Markdown Report Issue Upgrade to Chat

References (1)

Creating A Multi-track Classical Musical Performance Dataset for Multimodal Music Analysis: Challenges, Insights, and Applications (2016)

Topic to Video (Beta)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to University of Rochester Multi-Modal Music Performance (URMP) Dataset.