MoisesDB: Multi-Stem Music Dataset
- MoisesDB is a publicly available multitrack dataset offering high-fidelity audio, granular metadata, and a hierarchical stem taxonomy that reflects real-world studio mixing.
- It comprises 240 songs across 12 genres with a two-level annotation system, enabling both coarse-grained and fine-grained source separation analyses.
- The dataset provides a dedicated Python API and standardized benchmark protocols, facilitating reproducible experiments and advanced system evaluations.
MoisesDB is a publicly available, high-fidelity multitrack dataset designed to advance research in musical source separation by providing granular annotations and a flexible stem taxonomy that goes beyond the typical four-stem (vocals, drums, bass, other) paradigm. Comprising 240 songs from 47 artists across 12 popular genres, MoisesDB delivers raw instrument tracks, semantically labeled stems in a two-level hierarchy, and a complementary Python API, thereby establishing itself as a comprehensive resource for evaluating and building fine-grained source separation systems (Pereira et al., 2023).
1. Dataset Composition and Metadata
MoisesDB contains 240 songs with a total duration of 14 hours, 24 minutes, and 46 seconds; the average track length is 3 minutes and 36 seconds (±66 seconds). The tracks span 12 “high-level” genres—such as Pop, Rock, Jazz, Hip-Hop, R&B, Electronic, Folk, Latin, Reggae, Blues, Classical, and Other—with a power-law distribution in genre and artist representation. Each song includes a JSON metadata file describing artist, title, genre, duration (in seconds), sampling rate, raw recorded tracks, and the mapping of these tracks to high-level stems.
The dataset's design reflects the varied nature of recorded music: instrumentation diversity means the number and combination of stems differ from song to song. Common stems (e.g., “vocals,” “drums,” “bass”) are present in nearly every track; rare categories, such as “wind,” have minimal representation, underlining stem imbalance.
2. Two-Level Hierarchical Stem Taxonomy
Each song in MoisesDB includes both raw audio sources and their grouping into a two-level stem hierarchy that mirrors studio mixing workflows. The Level 1 stem categories and representative Level 2 track types appear as follows:
| Level 1 Category | Example Level 2 Types | Typical Occurrence |
|---|---|---|
| Bass | Bass Guitar, Bass Synthesizer, Contrabass | Nearly all songs |
| Bowed Strings | Cello, Viola Section, String Section | Variable |
| Drums | Snare Drum, Kick Drum, Drum Machine | Nearly all songs |
| Guitar | Acoustic, Electric (Clean/Distorted) | Frequent |
| Other | Fx | Occasional |
| Other Keys | Organ, Electric Organ, Synth Lead | Variable |
| Other Plucked | Banjo, Mandolin, Ukulele, Harp | Rare |
| Percussion | A-Tonal, Pitched Percussion | Frequent |
| Piano | Grand Piano, Electric Piano | Frequent |
| Vocals | Lead Female/Male, Background, Other | Nearly all songs |
| Wind | Brass, Flutes, Reeds, Other Wind | Very rare |
The taxonomy enables evaluation and modeling at arbitrary granularity, supporting both coarse-grained and fine-grained source separation scenarios. The number of stems per song is non-uniform, requiring dynamic strategies for model training and evaluation.
3. Data Formats, Organization, and Naming Conventions
All audio material is un-mastered, delivered as stereo WAV files at 44.1 kHz sampling rate, with no compression or heavy processing. The directory structure for each track is standardized:
1 2 3 4 5 6 7 8 9 10 11 12 |
data_path/ ├─ track_XXXX/ │ ├─ metadata.json # Metadata and mappings │ ├─ mixture.wav # Final stereo mix │ ├─ stems/ │ │ ├─ vocals.wav │ │ ├─ drums.wav │ │ └─ … │ └─ sources/ │ ├─ Snare_Drum.wav │ ├─ Kick_Drum.wav │ └─ … |
Stems are named with ASCII-safe identifiers (e.g., “vocals.wav,” “guitar.wav”), while individual source tracks reflect Level 2 typing (e.g., “Flutes.wav”). This organization facilitates flexible data access, robust labeling, and reproducibility in experimental protocols.
4. Python API and Data Access
A dedicated Python package, “moisesdb” (available at https://github.com/moises-ai/moises-db), supports downloading, preprocessing, and dataset manipulation. Installation is via pip install moisesdb. The API allows access to track-level metadata, mixture and stem audio as NumPy arrays, automated mixing of raw sources, track iteration, and on-disk stem saving. Example usage:
1 2 3 4 5 6 |
from moisesdb.dataset import MoisesDB db = MoisesDB(data_path='./moises-db-data') track = db[0] mixture = track.audio # (2, N) numpy array stems = track.stems # { 'vocals': array, 'drums': array, ... } track.save_stems('./output/track_0') |
This structure supports traditional experimentation as well as new use cases requiring dynamic source grouping or re-mixing.
5. Baseline Evaluation Protocol and Benchmarks
MoisesDB provides standardized reference protocols for benchmarking source separation systems. Evaluations are conducted at 4-stem, 5-stem, and 6-stem granularity. Only tracks containing at least the required stem set are included (N=235 for 4-stem, N=104 for 5-stem, N=88 for 6-stem). Surplus stems are linearly summed into an “other” category.
System performance is measured using the BSS-Eval Source-to-Distortion Ratio (SDR) metric:
Here, is the reference source, is the estimate, and is a small constant.
Oracle methods benchmarked:
Open-source models evaluated:
- Spleeter (4- and 5-stem configurations)
- Hybrid-Transformer Demucs (HT-Demucs; 4- and 6-stem configurations)
Key results:
| Configuration | Model | SDR (mean ± std, median, dB) |
|---|---|---|
| 4-Stem | HT-Demucs | 9.91 ± 3.27 (9.69 med) |
| 4-Stem | MWF | 9.08 ± 2.15 (8.87 med) |
| 4-Stem | IRM | 8.97 ± 2.16 |
| 4-Stem | IBM | 7.14 ± 2.28 |
| 4-Stem | Spleeter | 6.29 ± 2.47 |
| 5-Stem | MWF | 7.81 ± 2.66 (7.83 med) |
| 5-Stem | IRM | 7.65 ± 2.66 |
| 5-Stem | IBM | 5.12 ± 2.81 |
| 5-Stem | Spleeter | 4.66 ± 3.20 |
| 6-Stem | MWF | 7.06 ± 2.73 |
| 6-Stem | IRM | 6.91 ± 2.70 |
| 6-Stem | IBM | 5.12 ± 2.81 |
| 6-Stem | HT-Demucs | 6.24 ± 5.17 |
Two notable patterns emerge: (1) IRM and MWF oracle methods show similar performance across stem groupings, and (2) HT-Demucs surpasses oracle methods on “bass” and “drums” stems in the 4- and 6-stem settings, underscoring the efficacy of modern deep learning architectures for certain instrument categories.
6. Insights, Limitations, and Research Utility
MoisesDB’s stem taxonomy, which is aligned with real-world mixing processes (raw tracks → stems → mixture), enables detailed analysis of error patterns and the construction of highly granular separation models. The inherently imbalanced distribution of stems—especially the scarcity of “wind” or specific “plucked” instrument tracks—poses data sufficiency challenges for rare-instrument modeling and potentially affects generalization.
Since all audio is un-mastered and possesses higher dynamic range (DR14) and lower loudness (LUFS) than typical commercial releases, there may be a domain shift when applying models trained on MoisesDB to mastered material.
Baseline findings indicate that state-of-the-art deep learning systems such as HT-Demucs now rival or even outperform best-case oracle masking techniques (IRM, MWF) on certain stems but experience performance drop-off as more granular separation is requested—especially for less-represented instrument types such as piano and guitar.
7. Summary and Significance
MoisesDB provides, for the first time, a multitrack dataset with a structured, hierarchical taxonomy for stems, supporting up to 12 instrument classes and enabling research in arbitrarily fine-grained source separation. With comprehensive annotations, unprocessed high-dynamic-range audio, and a dedicated Python API, MoisesDB addresses the limitations of four-stem datasets and establishes itself as an important benchmark for evaluating traditional and modern source separation systems at multiple granularities (Pereira et al., 2023).