Papers
Topics
Authors
Recent
Search
2000 character limit reached

YM2413-MDB: FM Video Game Music Dataset

Updated 4 March 2026
  • YM2413-MDB is a rigorously curated multi-instrument FM video game music dataset focusing on authentic YM2413 chip tracks from Sega and MSX systems of the late 1980s.
  • The dataset provides three aligned representations per track—audio (WAV), MIDI, and binary VGM commands—which facilitate detailed analysis and emotion-aware music retrieval.
  • It supports key MIR tasks such as emotion recognition and symbolic generation, offering a valuable benchmark for research in retro game music and FM synthesis.

YM2413-MDB is a rigorously curated multi-instrumental FM video game music dataset focused on pieces originating from the Yamaha YM2413 sound chip, widely utilized in Sega and MSX games of the late 1980s. The collection comprises 669 tracks with paired audio (WAV), symbolic (MIDI), and binary (VGM register command) representations, accompanied by systematically constructed multi-label emotion annotations. The dataset, designed to address limitations of prior resources—such as genre bias and lack of high-level mood labels—enables new research in emotion-aware music information retrieval (MIR), symbolic modeling, and emotion-conditioned music generation using authentic retro game music sources (Choi et al., 2022).

1. Source Material and Data Representation

YM2413-MDB’s corpus was selected to ensure consistency in instrumentation and to leverage the distinctive properties of FM synthesis as realized in the YM2413 chip (OPLL). Music was sourced from SMSpower.org and VGM Rips, focusing exclusively on titles employing the YM2413 and excluding files too brief or lacking musical content. VGM files encode low-level register operations of the chip, including note on/off, program change, volume, custom tone parameters, and timing via "wait" commands.

Each file was processed to extract three primary representations per song:

  • VGM (binary): Captures raw YM2413 register writes.
  • MIDI: Event sequences mapped to 16 channels—15 melodic preset instruments and one drum channel—using nearest MIDI pitch mapping and pitch-bend for residual detuning.
  • Audio (WAV): Rendered through VGMPlay or equivalent FM synthesizer emulation.

A total of 16 possible instrument "voices" are present: 15 fixed monophonic presets (e.g., clarinet, trumpet, square lead) and one rhythm/drum channel. In hardware, up to 9 distinct tones can be played simultaneously.

2. Preprocessing and Symbolic Alignment

Raw VGM files, lacking explicit beat or tempo annotations, require specialized alignment for MIR and generative modeling. Audio tracks rendered from VGM were analyzed with madmom’s beat/downbeat tracker to estimate temporal structure. The interpolated median tempo from detected beats enabled insertion of standardized MIDI tempo events.

Symbolic data (MIDI) was quantized to 48 time steps per bar to ensure consistent alignment for event-based modeling approaches. Inclusion criteria further mandated minimum track length and data integrity, discarding corrupted or non-musical files.

3. Emotion Tagging Protocol and Label Statistics

Emotion annotations in YM2413-MDB comprise a multi-stage process combining free-listening, clustering, and structured tagging:

  • Free-form listening sessions yielded 35 preliminary adjectives, pruned and grouped to a final vocabulary of 19 tags (e.g., cheerful, calm, boring, cute, creepy, bizarre, tense, touching, depressed, serious, cold, fluttered, grand, frustrating, comic, faint, speedy, and two additional items).
  • Two dedicated annotators assigned top tag(s) per track following personal listening.
  • Three additional verifiers (from a panel of ten native Korean speakers with domain experience) reviewed all candidate tags, eliminating any lacking at least two supporting votes. In instances where both original tags were removed, the highest agreement alternative was selected.

Tag distribution analysis reveals notable imbalance; tags such as cheerful, tense, touching, bizarre, and depressed dominate. Co-occurrence matrices identify associative trends (e.g., speedy ↔ tense, cute ↔ cheerful, boring ↔ calm), confirming structured relationships within the emotion space.

4. Baseline Tasks and Model Architectures

Two primary supervised tasks were designed: (i) symbolic- and (ii) audio-domain emotion recognition. Target labels included (a) Russell’s four affective quadrants (4Q), (b) binary arousal, binary valence, and (c) top tag binaries (“tense” vs. rest, “cheerful” vs. rest).

Input representations:

  • Symbolic: Multi-Track Music Machine (MMM) event encoding—8-bar, 6-track concurrent, 48 time steps per bar.
  • Audio: Short-chunk mel-spectrograms (e.g., 128 mel bands), chunked into fixed-length frames.

Modeling approaches:

  • Logistic Regression (feature average-pooling with binary/categorical heads).
  • LSTM+Attention (MMM sequences with contextual classification; cf. Lin et al., 2017).
  • Short-chunk ResNet (spectrogram-based; as in Won et al., 2020) with frame aggregation.

Training regimen:

80/10/10 stratified split (by top-tag), Adam(W) optimizer (learning rate ~1e-4–1e-3), batch sizes 32–64. Loss functions: categorical or binary cross-entropy, early stopping by validation loss.

Evaluation metrics:

Metric Formula Description
Hamming Loss HL=1NLi=1Nj=1LI[yijy^ij]\text{HL} = \frac{1}{N L}\sum_{i=1}^N\sum_{j=1}^L \mathbb{I}[\,y_{ij}\neq \hat y_{ij}\,] Per-label error average across samples and labels
Subset Accuracy SA=1Ni=1NI[yi=y^i]\text{SA} = \frac{1}{N}\sum_{i=1}^N \mathbb{I}\bigl[y_i = \hat y_i\bigr] Multi-label exact match accuracy
Macro Precision/Recall P=1NiPiP = \frac{1}{N} \sum_i P_i,\ R=1NiRiR = \frac{1}{N} \sum_i R_i Sample-wise macro-averaged precision and recall
Macro F1 F1=2PRP+RF1 = \frac{2\,\text{P}\,\text{R}}{\text{P} + \text{R}} Macro-averaged F1F_1 across all samples

Results demonstrate that in the symbolic domain, logistic regression outperformed LSTM+Attention on MMM input (e.g., valence accuracy approximately 0.59 vs. 0.56). In the audio domain, ResNet architectures substantially exceeded logistic regression baselines, achieving 4Q accuracy near 0.65 (vs. 0.38).

5. Emotion-Conditioned Symbolic Generation

Emotion-conditioned generation experiments addressed the modeling of musical structure associated with distinct affective states. Due to significant label imbalance, only “cheerful” and “depressed” subsets (286 songs) were used, with further pre-training on ∼10,000 filtered LMD MIDI files (longer than 3 seconds, non-SFX content).

Model backbone: GPT-2–style Transformer (12 layers, hidden size 512, 8 heads, context window 1024).

  • Input: MMM event tokens, prefixed with a single emotion tag token.
  • Conditioning: In-attention mechanism per MuseMorphose (cf. Wu et al.)—an emotion embedding (Word2Vec-initialized, 512-dim) inserted into self-attention key/query space at each layer.

Training objective: Autoregressive next-token cross-entropy,

LCE=t=1Tlogpθ(xtx<t,e)\mathcal{L}_{\text{CE}} = -\sum_{t=1}^T \log p_\theta(x_t \mid x_{<t}, e)

where ee is the emotion-tag embedding, with teacher forcing for sequence supervision.

Evaluation:

  • Quantitative: Comparing note-density and note-duration distributions (using kernel density estimates), cheerful conditioning yielded denser, shorter notes; depressed conditioning, sparser and longer patterns.
  • Qualitative: Informal listening highlighted issues in onset precision (attributable to training data’s delayed/unison motifs) and unreliable control of tonality (major/minor).

6. Insights, Use Cases, and Limitations

YM2413-MDB supports systematic MIR and generation studies specific to FM video game music, furnishing the following observations:

  • Key detection (Krumhansl–Schmuckler method) correlates major keys with positive, minor with negative emotion tags.
  • Rhythmic features (note density/duration) differentiate fast/tense vs. slow/sad tracks.
  • Audio-based models (ResNet on mel-spectrograms) reach ∼65% 4Q accuracy; symbolic approaches are challenged by MMM distributional imbalance.
  • Transformer-based emotion-conditioned symbolic generation is feasible, yet current data scale limits expressive capabilities such as precise tonality control.

Potential research directions include expanding to other FM chips and eras (e.g., YM2612/Mega Drive), developing more balanced multi-label generation frameworks, improving beat-accurate symbolic alignment and incorporation of velocity dynamics, and exploring contrastive objectives or increased model capacity for finer emotion modulation. These findings position YM2413-MDB as a new benchmark for emotion-aware MIR, musicology studies of FM game music, and as a foundation for multi-instrumental, emotion-conditioned symbolic generation systems (Choi et al., 2022). The dataset, code, and pre-trained models are publicly accessible at https://github.com/jech2/YM2413-MDB, with audio demonstrations available at https://jech2.github.io/YM2413-MDB/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to YM2413-MDB.