Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MusicBench: Text-to-Music Generation Dataset

Updated 30 June 2025
  • MusicBench is a comprehensive dataset that pairs 10-second music clips with detailed natural language captions and explicit musical controls like chord, key, and tempo.
  • It employs advanced augmentation methods, including pitch shifts, tempo modulation, and caption paraphrasing, to enhance dataset diversity and robustness.
  • The dataset supports state-of-the-art text-to-music generation models by enabling fine-grained control and objective evaluation of musical features.

MusicBench is a large-scale, publicly available benchmark dataset developed for training and evaluating controllable text-to-music generation systems. Specifically designed to address the gap in datasets pairing music audio with rich, musically meaningful natural language descriptions, MusicBench enables fine-grained conditional generation tasks that are not possible with earlier, smaller, or less structured resources. It forms the foundation of controllable music generation models such as Mustango, supporting state-of-the-art performance and robust objective evaluation protocols.

1. Dataset Construction and Composition

MusicBench comprises 52,768 audio–text pairs, each consisting of a 10-second music clip and an associated multi-sentence textual description. The audio clips are sourced and expanded from the MusicCaps dataset, which supplied expert-authored captions for approximately 5,500 diverse music excerpts. MusicBench increases this scale by over an order of magnitude using a rigorous augmentation pipeline.

Each entry contains:

  • Music Audio: 10-second clip (from original or augmented sources)
  • Multi-sentence Caption: Describing mood, instrumentation, style, and high-level character
  • Music-Theory-Based Control Sentences: Explicit, machine-readable information on chord progression, key, tempo (either BPM or descriptive Italian tempo markings), beat count/time signature, and dynamic features (e.g., crescendo/decrescendo)
  • Augmentation Metadata: Indicating any modification to pitch, tempo, or volume relative to the base sample

A typical instance will blend a natural, human-readable caption (“Smooth jazz with saxophone and brushed drums...”) with appended structured control sentences (e.g., “The chord progression in this song is Dm7, G7, Cmaj7. The tempo of this song is Adagio. The key of this song is C major.”).

2. Data Augmentation and Description Enrichment Techniques

The dataset employs a multifaceted augmentation strategy to increase diversity, enhance robustness, and equip the resource for conditional generation:

  • Pitch Shifts: Each original clip is transposed up to ±3 semitones (yielding 6 additional pitch variants per track).
  • Tempo Modulation: Speeds are altered by ±(5–25)% on each audio excerpt, generating rhythmic and expressive variants.
  • Volume Envelopes: Crescendo or decrescendo is applied algorithmically, expanding timbral and dynamic coverage.

These augmentations are mirrored in the captions: corresponding control sentences accurately reflect the musical changes (e.g., updates to BPM, key signature after pitch shift). Textual diversity is further boosted through paraphrasing using LLMs (such as ChatGPT), thereby reducing caption repetition and simulating real-world prompt variability. For robustness in downstream models, random omission and sequencing of control sentences are introduced during prompt construction, ensuring that models learn to handle incomplete or noisy input.

Four canonical MIR (Music Information Retrieval) toolchains—BeatNet (beats/downbeats), Chordino (chords), Essentia KeyExtractor (key), and custom tempo estimation—are used to extract musicological features. These are verbalized using standardized templates (e.g., “The time signature is 3/4”, “The chord progression is Am, F, C, G”).

3. Explicit Musical Control in Text Descriptions

A defining feature of MusicBench is the inclusion of highly explicit, musically grounded control signals within the text descriptions. Unlike earlier captioning datasets that restrict themselves to mood, genre, or “vibe,” MusicBench prompts can condition models on:

  • Chord Progression: Enumerated in standard labeling (e.g., “Am, F, C, G”)
  • Key: Specified as root note and mode (“A minor”, “G major”)
  • Tempo: Either numerically (BPM) or musically (“Allegro,” “Adagio”)
  • Time Signature/Beat Count: e.g., “The beat counts to 3,” “Time signature is 6/8”
  • Dynamic Features: e.g., “This song gets gradually louder,” “Volume decreases over time”

This configuration enables training and evaluation of models for true conditional music generation, track guidance, and structural control, a level of granularity unmatched by prior public datasets.

4. Distinctions and Comparative Analysis

MusicBench represents a significant advance in size, coverage, and functional richness compared to prior datasets:

Dataset Size Audio Caption Detail Control Info Public?
MusicCaps ~5,500 10s clips Rich, natural No Yes
AudioCaps ~46,000 10s (mixed) Concise, general No Yes
MusicBench 52,768 10s (augm.) Rich & paraphrased Chord, key, tempo, etc. Yes
MusicGen Private 10s+ General, modeled Weak (melody only) No (train)
Mustango 52,768 10s (augm.) Rich, musically precise Full control Yes

Compared to MusicCaps and AudioCaps, MusicBench is approximately 10x larger, features explicit and structured musical controls, and includes robust paraphrasing and augmentation for improved generalization. Unlike proprietary datasets used by state-of-the-art models (e.g., MusicGen), MusicBench is fully publicly released, allowing transparent benchmarking and reproducible research in controllable text-to-music generation.

5. Applications and Evaluation Protocols

MusicBench is intended for multiple research and development scenarios:

  • Text-to-Music Generation: Training and evaluation of generative models capable of obeying both qualitative (mood, genre) and quantitative (chords, tempo) textual prompts.
  • Fine-Grained Controllability: Benchmarking and systematic assessment of models’ ability to follow explicit structural instructions, e.g., generating audio with a specified chord progression or in a specified key.
  • Robustness Testing: Allowing assessment of generation under incomplete, noisy, or conflicting prompts due to its deliberate prompt dropout and paraphrase variance.
  • Music Information Retrieval: Providing data for secondary tasks such as chord/key extraction and MIR tool evaluation, enabled by its synthetically varied and annotated audio.

Performance evaluation commonly employs both human and objective metrics. For objective evaluation, alignment between requested and generated features—such as tempo, key, chord accuracy—are computed using MIR tools. For overall generation quality, standard metrics, including Fréchet Audio Distance (FAD) and Kullback–Leibler (KL) divergence between distributions of generated and real audio feature embeddings, are employed. These have already facilitated in-depth studies on the tradeoffs between model compressibility and generation quality for end-to-end text-to-music architectures using MusicBench as a core benchmark.

6. Technical Details and Example Formulas

Key extraction methods for control features are operationalized as follows:

  • Beat Extraction: bRLbeats×2b \in \mathbb{R}^{L_\text{beats} \times 2}, where each entry encodes beat type and timestamp.
  • Chord Extraction: cRLchords×3c \in \mathbb{R}^{L_\text{chords} \times 3}, structured as (root, type, inversion).
  • Chord Encoder Input:

Encc(c):=Wc(FME(c[:,0])OHt(c[:,1])OHi(c[:,2])MPE(c[:,3]))\mathrm{Enc}^c(c) := W_c \big(\mathrm{FME}(c[:, 0]) \oplus \mathrm{OH}_t(c[:, 1]) \oplus \mathrm{OH}_i(c[:, 2]) \oplus \mathrm{MPE}(c[:, 3])\big)

where FME, OH, MPE denote specialized music embeddings and positional or one-hot encodings relevant to musical attributes.

Control sentences are templated for systematic processing in both training and inference:

  • "The chord progression is ss."
  • "This song is in the key of kk."
  • "The bpm is vv."
  • "The beat counts to bb."

7. Practical Impact and Future Directions

MusicBench closes a critical gap in music AI by delivering a public dataset at a scale and musical depth sufficient for state-of-the-art, control-focused text-to-music generation and evaluation. Its design supports research in prompt controllability, generalization under prompt sparsity, MIR feature extraction, and frame-accurate conditional composition.

Potential future expansions include longer audio segments, broader genre/language inclusion, and higher-level musical control (e.g., structure templates, melodic anchors). As models and requirements in symbolic and audio music generation evolve, the structural foundations laid by MusicBench position it as a reference dataset for both academic benchmarking and practical creative tool development in controllable music AI.