Improving Source Extraction with Diffusion and Consistency Models (2412.06965v1)

Published 9 Dec 2024 in cs.SD and eess.AS

Abstract: In this work, we demonstrate the integration of a score-matching diffusion model into a deterministic architecture for time-domain musical source extraction, resulting in enhanced audio quality. To address the typically slow iterative sampling process of diffusion models, we apply consistency distillation and reduce the sampling process to a single step, achieving performance comparable to that of diffusion models, and with two or more steps, even surpassing them. Trained on the Slakh2100 dataset for four instruments (bass, drums, guitar, and piano), our model shows significant improvements across objective metrics compared to baseline methods. Sound examples are available at https://consistency-separation.github.io/.

Summary

The paper integrates diffusion and consistency models into a deterministic source extraction architecture to enhance the quality of separated audio sources.
The methodology incorporates a diffusion model to enhance sources and uses Consistency Distillation for single-step, high-speed inference.
Key results show the Consistency Distillation model achieved significant quality improvements, yielding up to a 6.5 dB SI-SDR gain over baseline source extraction methods.

The paper addresses the problem of musical source extraction by integrating a score-matching diffusion model into a deterministic architecture to enhance audio quality. The authors focus on improving the performance of deterministic models, which often suffer from imperfect separation and source leakage due to the intricate relationships between musical sources and masking effects. To mitigate the slow iterative sampling inherent in diffusion models, the paper employs Consistency Distillation (CD) to reduce the sampling process to a single step, achieving comparable or superior performance.

Key aspects of the paper include:

Problem Statement: Isolating specific sound sources from a mixture of audio signals, emphasizing the challenges posed by music's interdependent sources and masking effects.
Methodology:
- A deterministic model is trained using a U-Net encoder-decoder architecture conditioned on the mixture signal $x_{\text{mix}}$ and instrument $s$ . The loss function is defined as:
$\mathcal{L}(\theta) = \mathbb{E}_{s, x_{\text{mix}}} \| x_s - \hat{x}_s^{\text{det}} \|_2^2$

where: - $\mathcal{L}(\theta)$ is the loss function. - $x_s$ is the true source. - $\hat{x}_s^{\text{det}}$ is the prediction for source $s$ . - $f_{\theta}$ is the deterministic source extraction model. - A denoising score-matching diffusion model is then incorporated to enhance extracted sources. The diffusion model $g_{\phi}$ takes four inputs: a noisy version of $x_s$ , the target label $s$ , the noise scale $\sigma$ , and intermediate features $\bar{x}_s^{\text{det}}$ extracted by the deterministic model. The model is trained with the Denoising Score Matching (DSM) loss:

$\mathcal{L}_{\text{DSM}}(\phi) = \mathbb{E}_{s, x_{\text{mix}}, t} \| x_s - g_{\phi}(x_s + \sigma_t \epsilon, s, \sigma_t, \bar{x}_s^{\text{det}}) \|_2^2$

where: - $\epsilon \sim \mathcal{N}(0, I)$ is Gaussian noise. - $\sigma_t := \sigma(t)$ is a monotonically increasing function defining the noise step and scale. - Consistency Distillation (CD) is applied to accelerate the generation process, training a consistency model $g_{\omega}$ using the pre-trained diffusion model $g_{\phi}$ as a teacher. The CD loss is:

$\mathcal{L}_{\text{CD}}({\omega}) = \mathbb{E}_{t, h} \| g_{sg(\omega)}(\hat{x}_{s, t-h}^{\text{dif}}, s, \sigma_{t-h} , \bar{x}_s^{\text{det}})_{target} - g_{\omega}(x_s + \epsilon \sigma_t, s, \sigma_t , \bar{x}_s^{\text{det}})_{prediction} \|_2^2$

where: - $h \in [1, t]$ is the number of ODE steps used in the distillation process. - $sg(\omega)$ denotes the stop-gradient running EMA of $\omega$ . - The final loss function combines the CD loss with the DSM loss:

$\mathcal{L}({\omega}) = \mathcal{L}_{\text{CD}}({\omega}) + \lambda_{\text{DSM}} \mathcal{L}_{\text{DSM}} (\omega)$

where $\lambda_{\text{DSM}}$ is a balancing term.
Dataset: The models were trained and tested on the Slakh2100 dataset, focusing on bass, drums, guitar, and piano.
Baselines: The proposed model is compared against Demucs, Demucs+Gibbs, and MSDM.
Evaluation Metric: Scale-invariant SDR improvement (SI-SDR $_\text{I}$ ) is used to evaluate performance, focusing on chunks of 4 seconds in length with a 2-second overlap.

Key results include:

The deterministic model outperformed baseline models, attributed to its architecture using self-attention instead of LSTM.
Incorporating a diffusion model further improved separation quality by approximately 1.2 dB.
The CD model accelerated performance and further improved quality, outperforming the diffusion model without adversarial training. Specifically, a four-step CD model achieved a 3 dB improvement over the deterministic model and a 6.5 dB improvement over the best baseline.

The paper claims three main contributions: bridging deterministic and generative methods, introducing a consistency framework to the raw audio domain, and achieving significant improvements in musical source extraction on the Slakh2100 dataset.

The authors conclude that mixed deterministic and generative models hold strong potential for advancing source extraction and separation, with future work focused on reducing audio segment lengths and model size to enable lightweight, real-time applications.

The paper uses the Slakh2100 dataset, comprising 2100 tracks with a split of 1500/375/225 for training, validation, and testing, respectively. The audio is downsampled to 22kHz, and segments of $N = 262144$ samples (approximately 11.9 seconds) are used. The models focus on bass (94.7\% presence), drums (99.3\%), guitar (100.0\%), and piano (99.3\%).

The deterministic, diffusion, and consistency models are all based on a U-Net backbone. The U-Net for the deterministic model was modified to accept a one-hot encoded instrument label. The diffusion model incorporates the deterministic model as a feature extractor. The consistency model is initialized using the pre-trained weights of the diffusion model. The Adam optimizer is used for training the deterministic and diffusion models, while the Rectified Adam optimizer is used for the consistency model.

The inference time, number of parameters, and real-time factors (RTF) are compared against baselines. The deterministic model demonstrates the best speed, with an inference time of 0.114 seconds and an RTF of 0.009. The diffusion model, while slower, outperforms generative baselines. The CD models, especially at $T=1$ , achieve an RTF of 0.023.

The authors also explored the performance of the models on the MUSDB18 dataset.

PDF Markdown

Related Papers

GitHub

Consistency_separation

Tweets

https://twitter.com/AudioAndSpeech/status/1866739627035381885