MoisesDB Multitrack Music Dataset
- MoisesDB is a comprehensive multitrack dataset with hierarchically organized raw audio tracks, supporting granular music source separation.
- The dataset employs a two-level taxonomy to group audio tracks into meaningful stems, facilitating both aggregate and fine-grained source separation experiments.
- MoisesDB offers a Python API and benchmark evaluations using oracle methods and neural models, enabling detailed performance analysis via SDR and related metrics.
MoisesDB is a large-scale, publicly available multitrack dataset developed to advance music source separation research beyond the traditional four-stem paradigm. Comprising 240 stereo songs from 47 distinct artists and spanning 12 high-level genres, MoisesDB addresses the data scarcity that historically limited separation systems to vocals, drums, bass, and "other" stems. Each song includes individually recorded raw audio tracks organized into a two-level hierarchical stem taxonomy, facilitating fine-grained, configurable separation strategies and enabling research at increased stem granularities (Pereira et al., 2023).
1. Dataset Composition
MoisesDB contains 240 stereo songs with a total duration of approximately 14 hours, 24 minutes, and 46 seconds. The dataset encompasses 47 distinct artists across 12 genres including, but not limited to, Rock, Pop, Jazz, Electronic, and Folk. The genre distribution exhibits a power-law characteristic, with a small number of genres accounting for the majority of tracks.
Each song comprises "raw" audio tracks (e.g., snare drum, acoustic guitar, cello), which are semantically grouped into stems per the designated taxonomy. The number of stems per track ranges from three to ten, reflecting the diversity of instrumentation and production styles. "Vocals," "drums," and "bass" stems are present in nearly all songs. In contrast, stems such as "wind" and "other plucked" are comparatively rare. This natural imbalance reproduces real-world catalog characteristics, providing a realistic and challenging environment for source separation systems.
2. Hierarchical Stem Taxonomy
MoisesDB implements a two-level hierarchical taxonomy to group its raw audio tracks into musically and operationally meaningful stems. There are 11 top-level stems, each further subdivided into specific sub-stems. This structure mirrors the workflow of practical mixing sessions and enables both granular and aggregate source separation experiments.
| Top-Level Stem | Selected Sub-Stems |
|---|---|
| Bass | Bass Guitar, Bass Synthesizer, Contrabass |
| Bowed Strings | Cello, Cello Section, String Section, Viola Solo |
| Drums | Cymbals, Drum Machine, Kick Drum, Snare Drum, Toms |
| Guitar | Acoustic Guitar, Clean Electric, Distorted Electric |
| Other | Fx |
| Other Keys | Organ, Electric Organ, Synth Lead, Synth Pad |
| Other Plucked | Banjo/Mandolin/Ukulele/Harp |
| Percussion | Pitched Percussion, A-Tonal Percussion |
| Piano | Electric Piano, Grand Piano |
| Vocals | Lead Female Singer, Lead Male Singer, Background |
| Wind | Brass, Flutes, Reeds, Other Wind |
The taxonomy's granularity supports separation models capable of distinguishing, for example, between "guitar" and "other plucked" sources, and offers a realistic testbed for error analysis across related classes.
3. Data Access and Usage
MoisesDB is distributed with an accompanying Python package available from the Python Package Index (PyPI), which handles metadata parsing, stem construction, mixing, and I/O. Installation is performed via:
1 |
pip install moisesdb |
The API enables downloading, inspecting, and processing tracks:
1 2 3 4 5 6 7 8 |
from moisesdb.dataset import MoisesDB db = MoisesDB(data_path='./moises-db-data', download=True) print(f"Total tracks available: {len(db)}") track0 = db[0] mix = track0.audio # numpy array: (2, samples) stems = track0.stems # e.g. {'vocals': array, 'drums': array, …} track0.save_stems('my_output/track_000') |
Preprocessing for machine learning pipelines involves stacking stem sources into tensors or computing short-time Fourier transforms (STFTs) dynamically. The dataset structure supports rapid prototyping for various separation tasks.
4. Baseline Performance and Evaluation Metrics
MoisesDB includes benchmark performance results using three oracle methods—Ideal Binary Mask (IBM), Ideal Ratio Mask (IRM), and Multichannel Wiener Filter (MWF)—as well as two open-source neural architectures: HT-Demucs and Spleeter. The primary evaluation metric is Source-to-Distortion Ratio (SDR):
where is the reference signal, the estimate, and a small constant; additional metrics SIR and SAR are also cited for comprehensive evaluation.
Representative SDR values (mean ± standard deviation, with median in parentheses) for primary settings:
| Stems | Model/Oracle | SDR (dB) | N |
|---|---|---|---|
| 4 (voc, dr, bs, o) | HT-Demucs | 9.91 ± 3.27 (9.69) | 235 |
| Spleeter | 6.29 ± 2.47 (6.24) | ||
| IBM | 7.14 ± 2.28 (6.99) | ||
| IRM | 8.97 ± 2.16 (8.81) | ||
| MWF | 9.08 ± 2.15 (8.87) | ||
| 5 (+ piano) | Spleeter | 4.66 ± 3.20 (5.02) | 104 |
| IBM | 5.12 ± 2.81 (4.87) | ||
| IRM | 7.65 ± 2.66 (7.60) | ||
| MWF | 7.81 ± 2.66 (7.83) | ||
| 6 (+ guitar) | HT-Demucs | 6.24 ± 5.17 (6.05) | 88 |
| IBM | 5.12 ± 2.81 (4.87) | ||
| IRM | 6.91 ± 2.70 (6.69) | ||
| MWF | 7.06 ± 2.73 (6.89) |
Notably, HT-Demucs outperforms some oracle-based upper bounds (e.g., IBM) on bass and drums in both 4- and 6-stem configurations, highlighting the advancement of neural separation models.
5. Dataset Analysis and Characteristics
MoisesDB mirrors real-world catalog imbalances: a subset of genres (e.g., pop/rock) and artists are overrepresented. Almost all tracks include "vocals," "drums," and "bass," with underrepresented classes such as "wind" and "other plucked" providing opportunities to evaluate rare-class separation.
Track durations have a mean of 3 minutes 36 seconds (standard deviation 66 seconds). The dataset utilizes unmastered mixes, which exhibit lower loudness levels (approximately –15 LUFS) and higher dynamic ranges compared to typical commercial releases. This suggests that models trained on MoisesDB may need adaptation strategies for deployment on mastered audio.
6. Research and Application Scenarios
MoisesDB enables diverse research directions:
- Fine-grained source separation: Models can be trained and evaluated on granularities ranging from three to ten stems, moving beyond canonical four-stem paradigms.
- On-the-fly data augmentation: By grouping raw tracks into user-defined stems, novel mixtures and training examples can be synthesized programmatically.
- Error analysis: The hierarchical taxonomy allows for targeted confusion analysis between close instrument families (e.g., "guitar" vs. "other plucked").
- Cross-task integration: Separated stems can be used for downstream music information retrieval (MIR) tasks including chord estimation, melody extraction, and educational applications such as karaoke or play-along track creation.
- Domain adaptation: Investigations into the impact of unmastered data, as well as hybrid training with mastered datasets, are facilitated by the exhaustive recording conditions.
MoisesDB's combination of scale, taxonomic detail, and public availability positions it as a valuable foundation for next-generation source separation system development and evaluation (Pereira et al., 2023).