Papers
Topics
Authors
Recent
2000 character limit reached

SynTheory Dataset: A Comprehensive Overview

Updated 18 November 2025
  • SynTheory is a synthetic dataset that isolates fundamental Western music theory concepts—tempo, scales, chords, and more—with invariant musical attributes.
  • It programmatically generates single-track MIDI files and rendered WAV audio to systematically control and annotate isolated musical elements.
  • The dataset enables rigorous benchmarking and interpretability analysis of music foundation models in tasks like tempo estimation, chord classification, and symbolic retrieval.

SynTheory is a fully synthetic, copyright-free dataset designed to isolate and encode fundamental Western music theory concepts for use in music foundation model analysis and music information retrieval. Each dataset sample represents a canonical musical concept—tempo, time signature, note, interval, scale, chord, or chord progression—with all other musical attributes held invariant. The dataset's construction is optimized for probing how music generation models, such as Jukebox and MusicGen, encode and internalize musical structure in their latent representations (Wei et al., 1 Oct 2024). SynTheory provides both symbolic (MIDI) and rendered audio (WAV) forms, with extensive annotations for benchmarking, analysis, and interpretability in machine learning and computational musicology workflows.

1. Dataset Construction and Concept Isolation

SynTheory programmatically generates all content as single-track MIDI files to ensure both systematic coverage and low-level concept control. SoundFont-based synthesis is used for rendering to audio, employing TimGM6mb.sf2 through tools such as timidity++. Tonal concepts—notes, intervals, scales, chords, chord progressions—are voiced across 92 distinct General MIDI instruments (excluding polyphonic, sound-effect, or highly-articulated patches). Rhythmic datasets (tempos, time signatures) use five percussive timbres to discourage overfitting to artifact patterns.

Each sample contains precisely one “active” concept, preventing confounding influence from other dimensions. For example, a tempo sample varies only in BPM and onset, while timbre, dynamic, and pitch are fixed. Chord and scale samples are fully specified in both symbolic and audio form, according to enumerated rules.

The key classes of concept instances are:

  • Notes: 12 pitch classes across nine octaves (m[12..119]m\in[12..119]).
  • Intervals: For each root pitch class and each half-step h{112}h \in \{1\ldots12\}, all play styles (unison, melodic ascending, melodic descending).
  • Scales: Seven diatonic modes on every root, in both ascending and descending variants.
  • Chords: Triads for all 12 roots, four qualities (major, minor, augmented, diminished), each in three inversions.
  • Chord Progressions: 19 common four-chord progressions (10 major, 9 minor), transposed to all 12 keys.
  • Tempo: Integer BPM values between 50 and 210 (in 4/4), five onset offsets per BPM.
  • Time Signatures: Eight meters at 120 BPM, three reverb levels, five timbres, ten onset offsets.

No mixing or multi-track blending occurs; all signals are “clean” and concept-focused. Audio files are rendered at 44.1 kHz, 16-bit mono, 4 seconds per sample, and can optionally be resampled to 32 kHz for model compatibility.

2. Mathematical Definitions and Encodings

SynTheory encodes all fundamental Western theoretical constructs used in the dataset, following MIDI and music theory standards.

Pitch Classes

The pitch class pp of a MIDI note number mm is computed as

p=mmod12,p{0,...,11}p = m \bmod 12,\quad p \in \{0, ..., 11\}

with C=0, C/D=1, ...,B=11C=0,\ C\sharp/D\flat=1,\ ..., B=11.

Intervals

The (signed) interval in semitones between two notes of pitch-classes pi, pjp_i,\ p_j is

Δp=pjpi\Delta p = p_j - p_i

(often taken modulo 12). All diatonic intervals (h{112}h\in\{1\ldots12\}) are labeled explicitly, e.g., minor 2nd (1), major 2nd (2), through octave (12).

Scales

A scale is represented by its semitone offsets from root. Ionian (major):

{0,2,4,5,7,9,11}\{0,2,4,5,7,9,11\}

The seven diatonic modes (Ionian, Dorian, Phrygian, Lydian, Mixolydian, Aeolian, Locrian) are programmed as R+R + their respective interval sets.

Chord Qualities

Triads are specified by root semitone offsets:

  • Major: {0,4,7}\{0,4,7\}
  • Minor: {0,3,7}\{0,3,7\}
  • Augmented: {0,4,8}\{0,4,8\}
  • Diminished: {0,3,6}\{0,3,6\}

Inversions are generated by cyclically raising voices by an octave (e.g., {4,7,12}\{4,7,12\} for first inversion of major).

3. Dataset Statistics and Parameterization

SynTheory comprehensively enumerates permutations of each concept, summarized as follows:

Concept Number of Examples Key Parameters
Tempo 4,025 161 BPM × 5 offsets
Time Signatures 1,200 8 meters × 3 reverb × 5 timbres × 10 offsets
Notes 9,936 (9,848 non-silent) 12 pitch classes × 9 octaves × 92 instruments
Intervals 39,744 12 roots × 12 sizes × 92 instruments × 3 styles
Scales 15,456 7 modes × 12 roots × 92 instruments × 2 directions
Chords 13,248 12 roots × 4 types × 92 instruments × 3 inversions
Chord Progressions 20,976 19 seq × 12 keys × 92 instruments

All concept classes (tempo, meter, mode, progression, etc.) are exhaustively enumerated and class-balanced. BPM and meter are sampled uniformly, with all discrete combinations present.

4. Formats, Annotation, and Data Access

Each SynTheory sample is comprised of:

  • Standard MIDI file (.mid)
  • Corresponding WAV file (44.1 kHz, 16-bit, mono, 4 seconds)
  • CSV/JSON metadata file:
    • Sample ID
    • Concept type (e.g. "Minor 6th")
    • Numeric class label (e.g. interval or mode index)
    • Root note or BPM
    • Instrument or timbre ID
    • Play style or inversion (if relevant)
    • Reverb level and onset offset (if relevant)

This systematic annotation permits direct supervised benchmark construction as well as model-oriented “probing” experiments.

5. Benchmarking, Probing, and Analytic Use Cases

SynTheory functions as a core benchmark and probing dataset for audio-based machine learning tasks:

  • Probing music foundation models: The dataset delivers isolated, perfectly labeled stimuli for extracting and quantifying latent encodings in deep audio models. One can train lightweight classifiers (linear, MLP) on model activations to assess encoding of tempo, key, pitch class, chord quality, and other theory concepts.
  • MIR and theory analysis: SynTheory provides abundant, noise-free data for evaluating core music information retrieval (MIR) domains: tempo estimation, time-signature detection, symbolic note identification, interval/scale/chord classification, and chord inversion recognition.
  • Model interpretability and control: By mapping which transformer layers or heads best encode specific theory concepts, SynTheory enables construction of “concept editors”—tools or interventions that manipulate latent musical features (e.g. changing a scale or retuning a progression) at the representation level. This suggests possible strategies for controllable and fine-grained symbolic/harmonic editing in generative audio models.

6. Context and Significance for Music AI Research

SynTheory is designed to address deficiencies in real-world musical model interpretability by providing a controlled, exhaustive, and scalable evaluation corpus. Its fully synthetic nature eliminates confounds associated with performance style, acoustics, and compositional intent, focusing analysis strictly on the specified concept.

Its design enables rigorous benchmarking for both analytic (e.g. computational musicology, pitch detection, inversion recognition) and generative (e.g. transformer-based music-LMs, controllable generation) research. SynTheory thereby advances the paper of how neural audio models internalize and represent low-level symbolic knowledge, and serves as a foundational resource for interpretability and Music-Information-Retrieval research (Wei et al., 1 Oct 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to SynTheory Dataset.