Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
91 tokens/sec
Gemini 2.5 Pro Premium
40 tokens/sec
GPT-5 Medium
33 tokens/sec
GPT-5 High Premium
28 tokens/sec
GPT-4o
105 tokens/sec
DeepSeek R1 via Azure Premium
93 tokens/sec
GPT OSS 120B via Groq Premium
479 tokens/sec
Kimi K2 via Groq Premium
160 tokens/sec
2000 character limit reached

SonicMaster Dataset for Music Restoration

Updated 7 August 2025
  • SonicMaster dataset is a large-scale, text-conditioned corpus that pairs degraded and high-quality music recordings with natural language prompts for restoration and mastering.
  • It employs 19 degradation functions to create 7 derived versions per clip, simulating diverse artifacts across multiple genres.
  • The dataset underpins a flow-matching generative training paradigm, yielding robust restoration outcomes validated by both objective and subjective evaluation metrics.

The SonicMaster dataset is a large-scale, text-conditioned corpus created explicitly to support research and development in controllable music restoration and mastering. It comprises paired degraded and high-quality musical recordings, constructed to enable both automated and text-guided correction of a broad range of audio artifacts through a unified generative modeling approach. The dataset was developed to serve as the foundation for the SonicMaster model, which executes both restoration and mastering tasks on diverse musical material with fine-grained specificity (Melechovsky et al., 5 Aug 2025).

1. Dataset Construction and Structure

The SonicMaster dataset was compiled starting from a pool of approximately 580,000 musical recordings sourced from Jamendo. From this source material, curators selected roughly 25,000 high-quality, 30-second excerpts representing ten major genre clusters, including but not limited to rock, hip-hop, classical, and electronic music.

Each clean clip in the dataset is systematically degraded using a suite of 19 distinct degradation functions, which are organized into five enhancement categories: equalization (EQ), dynamics, reverb, amplitude, and stereo. For every clean segment, seven derived instances are generated—comprising four single degradations, two double-degradation composites, and a single triple-degradation instance—ensuring a diverse sampling of artifact types and co-occurrences.

Clean Clips Genre Groups Degradation Functions Enhancement Classes Derived Versions per Clip
~25,000 10 19 5 7

This multi-dimensional augmentation schema results in a corpus that is both broad in genre and deep in artifact variability, positioning it uniquely for all-in-one audio restoration and mastering research.

2. Degradation Taxonomy and Simulation

The dataset’s primary technical distinction is its defined taxonomy of audio degradations, mapped systematically to real-world scenarios encountered in amateur music production.

  • EQ (Equalization): Simulated spectral imbalances include brightness, darkness, airiness, boominess, muddiness, warmth, cemented vocal presence (midband accentuation), clarity (via low-pass filtering), microphone coloration (convolution with 20 variants of Poliphone transfer functions), and parametrized multi-band EQ curves.
  • Dynamics: Compression and punch effects are artificially induced, yielding flattened transient profiles or overemphasized attacks.
  • Reverb: Reverberation is synthesized in four modes—three via Pyroomacoustics (emulating small, large, and mixed acoustic environments) and one using empirical impulse responses from the openAIR library.
  • Amplitude: Includes hard clipping (peak limiting causing harmonic distortion) and volume reduction (increased noise floor).
  • Stereo: Destereo functionality collapses stereo width, implemented by merging left and right channels contingent on initial width assessment.

Each artifacted audio segment is paired with a canonical, natural-language restoration prompt, reflecting either explicit correction instructions (e.g., “reduce the strong echo”) or requests for particular sonic traits. Prompts are tokenized and encoded for use in conditioning generative models.

3. Text Conditioning for Controlled Restoration and Mastering

A central innovation of the SonicMaster dataset is the pairing of each degraded audio sample with a corresponding natural language instruction. These text prompts are processed using a frozen FLAN-T5 encoder, producing text embeddings that are consumed by the restoration model. The language-conditioned corpus enables training of models that not only generalize across audio degradation types but can also execute user-specified enhancements, such as “make the vocals stand out” or “add more brightness.” When no prompt is provided, the model defaults to unsupervised, perceptually guided restoration and mastering. This text-to-audio mapping makes the dataset particularly suitable for research into human-in-the-loop and adaptive content-aware audio processing.

4. Generative Training Paradigm

The SonicMaster model—a multimodal implementation based on the DiT architecture—is trained on the dataset using a flow-matching generative framework (rectified flow paradigm). Both clean and degraded audio are encoded into a latent space using a VAE. Training proceeds as follows:

  • Latents from degraded (x₁) and clean (x₀) pairs are linearly interpolated:

xt=tx1+(1t)x0x_t = t \cdot x_1 + (1 - t) \cdot x_0

where tt is a scalar sampled from the non-uniform distribution

p(t)=0.5U(t)+0.5tp(t) = 0.5 \cdot U(t) + 0.5 \cdot t

(Equation 2 from (Melechovsky et al., 5 Aug 2025)), emphasizing extreme degradations.

  • The supervisory signal is the velocity vt=x0x1v_t = x_0 - x_1.
  • The model f(θ)f_{(\theta)} predicts velocity from xtx_t, conditioned on tt and text embedding:

L(θ)=Et,x1,x0f(θ)(xt,t,ctext)vt22L(\theta) = \mathbb{E}_{t, x_1, x_0}\|f_{(\theta)}(x_t, t, c_{text}) - v_t\|_2^2

  • During inference, integration of predicted velocities (e.g., via forward Euler step) maps the degraded latent back to a cleaned, mastered version corresponding to the specified condition.

This training regime is designed to facilitate accurate restoration and enhancement even in heavily degraded regimes, as the interpolated training samples densely cover the artifact continuum.

5. Evaluation Protocols and Results

The effectiveness of restoration and enhancement is assessed by both objective and perceptual criteria:

  • Objective Metrics:
    • Frechét Audio Distance (FAD) computed on CLAP embeddings to quantify generative fidelity relative to undistorted reference distributions.
    • Kullback–Leibler (KL) divergence and Structural Similarity Index (SSIM) on mel-spectrograms for signal-level accuracy.
    • Production Quality (PQ) derived using the Audiobox Aesthetics toolbox, evaluating mastering characteristics.
    • Artifact-aware metrics for class-specific restoration quality: cosine distance for EQ, frame-level RMS for dynamics, Euclidean modulation spectrum distance for reverb, spectral flatness for amplitude artifacts, and RMS stereo ratios for spatial imaging.
  • Subjective Metrics:
    • Human listening tests engaging domain experts, assessing text relevance, audio quality, enhancement consistency, and explicit preference on a 7-point Likert scale.

Reported results indicate consistent improvement across all degradation classes and user-driven preferences for SonicMaster-processed outputs over degraded counterparts. The integration of text conditioning enables nuanced improvements, with the model robust to both general and highly targeted restoration tasks.

6. Implications and Research Applications

The SonicMaster dataset provides a unified platform for developing and benchmarking text-conditioned, all-in-one music restoration and mastering systems. Its construction, comprehensive artifact diversity, and prompt-conditioned supervision paradigm enable research into controllable audio transformation, automation of high-level music production tasks, and context-aware perceptual restoration.

The dataset’s design facilitates several lines of inquiry:

  • Comparative studies of generative restoration models across various degradation and genre combinations.
  • Investigation of prompt efficacy and granularity for user-driven sound engineering.
  • Transfer learning scenarios within music and broader audio domains, leveraging the scale and multimodal conditioning of the dataset.

A plausible implication is that datasets with this degree of artifact diversity and prompt alignment may accelerate the convergence of automated audio engineering systems toward human-level flexibility and specificity in restoration and mastering.

7. Context within Audio and Machine Listening Research

The SonicMaster dataset and associated methodology represent a marked evolution relative to datasets designed for classification, separation, or single-artifact restoration. Its flow-matching training regime and prompt-aligned data bridge generative modeling (notably DiT-inspired architectures with VAE backbones) with real-world music production needs. In combination with its rigorous objective and subjective evaluation framework, SonicMaster establishes a new standard for scalable research in mixing, mastering, and artifact correction. Its development situates it at the intersection of machine listening, generative modeling, and text-conditioned audio control (Melechovsky et al., 5 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)