SynTheory Dataset: A Comprehensive Overview
- SynTheory is a synthetic dataset that isolates fundamental Western music theory concepts—tempo, scales, chords, and more—with invariant musical attributes.
- It programmatically generates single-track MIDI files and rendered WAV audio to systematically control and annotate isolated musical elements.
- The dataset enables rigorous benchmarking and interpretability analysis of music foundation models in tasks like tempo estimation, chord classification, and symbolic retrieval.
SynTheory is a fully synthetic, copyright-free dataset designed to isolate and encode fundamental Western music theory concepts for use in music foundation model analysis and music information retrieval. Each dataset sample represents a canonical musical concept—tempo, time signature, note, interval, scale, chord, or chord progression—with all other musical attributes held invariant. The dataset's construction is optimized for probing how music generation models, such as Jukebox and MusicGen, encode and internalize musical structure in their latent representations (Wei et al., 1 Oct 2024). SynTheory provides both symbolic (MIDI) and rendered audio (WAV) forms, with extensive annotations for benchmarking, analysis, and interpretability in machine learning and computational musicology workflows.
1. Dataset Construction and Concept Isolation
SynTheory programmatically generates all content as single-track MIDI files to ensure both systematic coverage and low-level concept control. SoundFont-based synthesis is used for rendering to audio, employing TimGM6mb.sf2 through tools such as timidity++. Tonal concepts—notes, intervals, scales, chords, chord progressions—are voiced across 92 distinct General MIDI instruments (excluding polyphonic, sound-effect, or highly-articulated patches). Rhythmic datasets (tempos, time signatures) use five percussive timbres to discourage overfitting to artifact patterns.
Each sample contains precisely one “active” concept, preventing confounding influence from other dimensions. For example, a tempo sample varies only in BPM and onset, while timbre, dynamic, and pitch are fixed. Chord and scale samples are fully specified in both symbolic and audio form, according to enumerated rules.
The key classes of concept instances are:
- Notes: 12 pitch classes across nine octaves ().
- Intervals: For each root pitch class and each half-step , all play styles (unison, melodic ascending, melodic descending).
- Scales: Seven diatonic modes on every root, in both ascending and descending variants.
- Chords: Triads for all 12 roots, four qualities (major, minor, augmented, diminished), each in three inversions.
- Chord Progressions: 19 common four-chord progressions (10 major, 9 minor), transposed to all 12 keys.
- Tempo: Integer BPM values between 50 and 210 (in 4/4), five onset offsets per BPM.
- Time Signatures: Eight meters at 120 BPM, three reverb levels, five timbres, ten onset offsets.
No mixing or multi-track blending occurs; all signals are “clean” and concept-focused. Audio files are rendered at 44.1 kHz, 16-bit mono, 4 seconds per sample, and can optionally be resampled to 32 kHz for model compatibility.
2. Mathematical Definitions and Encodings
SynTheory encodes all fundamental Western theoretical constructs used in the dataset, following MIDI and music theory standards.
Pitch Classes
The pitch class of a MIDI note number is computed as
with .
Intervals
The (signed) interval in semitones between two notes of pitch-classes is
(often taken modulo 12). All diatonic intervals () are labeled explicitly, e.g., minor 2nd (1), major 2nd (2), through octave (12).
Scales
A scale is represented by its semitone offsets from root. Ionian (major):
The seven diatonic modes (Ionian, Dorian, Phrygian, Lydian, Mixolydian, Aeolian, Locrian) are programmed as their respective interval sets.
Chord Qualities
Triads are specified by root semitone offsets:
- Major:
- Minor:
- Augmented:
- Diminished:
Inversions are generated by cyclically raising voices by an octave (e.g., for first inversion of major).
3. Dataset Statistics and Parameterization
SynTheory comprehensively enumerates permutations of each concept, summarized as follows:
| Concept | Number of Examples | Key Parameters |
|---|---|---|
| Tempo | 4,025 | 161 BPM × 5 offsets |
| Time Signatures | 1,200 | 8 meters × 3 reverb × 5 timbres × 10 offsets |
| Notes | 9,936 (9,848 non-silent) | 12 pitch classes × 9 octaves × 92 instruments |
| Intervals | 39,744 | 12 roots × 12 sizes × 92 instruments × 3 styles |
| Scales | 15,456 | 7 modes × 12 roots × 92 instruments × 2 directions |
| Chords | 13,248 | 12 roots × 4 types × 92 instruments × 3 inversions |
| Chord Progressions | 20,976 | 19 seq × 12 keys × 92 instruments |
All concept classes (tempo, meter, mode, progression, etc.) are exhaustively enumerated and class-balanced. BPM and meter are sampled uniformly, with all discrete combinations present.
4. Formats, Annotation, and Data Access
Each SynTheory sample is comprised of:
- Standard MIDI file (.mid)
- Corresponding WAV file (44.1 kHz, 16-bit, mono, 4 seconds)
- CSV/JSON metadata file:
- Sample ID
- Concept type (e.g. "Minor 6th")
- Numeric class label (e.g. interval or mode index)
- Root note or BPM
- Instrument or timbre ID
- Play style or inversion (if relevant)
- Reverb level and onset offset (if relevant)
This systematic annotation permits direct supervised benchmark construction as well as model-oriented “probing” experiments.
5. Benchmarking, Probing, and Analytic Use Cases
SynTheory functions as a core benchmark and probing dataset for audio-based machine learning tasks:
- Probing music foundation models: The dataset delivers isolated, perfectly labeled stimuli for extracting and quantifying latent encodings in deep audio models. One can train lightweight classifiers (linear, MLP) on model activations to assess encoding of tempo, key, pitch class, chord quality, and other theory concepts.
- MIR and theory analysis: SynTheory provides abundant, noise-free data for evaluating core music information retrieval (MIR) domains: tempo estimation, time-signature detection, symbolic note identification, interval/scale/chord classification, and chord inversion recognition.
- Model interpretability and control: By mapping which transformer layers or heads best encode specific theory concepts, SynTheory enables construction of “concept editors”—tools or interventions that manipulate latent musical features (e.g. changing a scale or retuning a progression) at the representation level. This suggests possible strategies for controllable and fine-grained symbolic/harmonic editing in generative audio models.
6. Context and Significance for Music AI Research
SynTheory is designed to address deficiencies in real-world musical model interpretability by providing a controlled, exhaustive, and scalable evaluation corpus. Its fully synthetic nature eliminates confounds associated with performance style, acoustics, and compositional intent, focusing analysis strictly on the specified concept.
Its design enables rigorous benchmarking for both analytic (e.g. computational musicology, pitch detection, inversion recognition) and generative (e.g. transformer-based music-LMs, controllable generation) research. SynTheory thereby advances the paper of how neural audio models internalize and represent low-level symbolic knowledge, and serves as a foundational resource for interpretability and Music-Information-Retrieval research (Wei et al., 1 Oct 2024).