- The paper integrates diffusion and consistency models into a deterministic source extraction architecture to enhance the quality of separated audio sources.
- The methodology incorporates a diffusion model to enhance sources and uses Consistency Distillation for single-step, high-speed inference.
- Key results show the Consistency Distillation model achieved significant quality improvements, yielding up to a 6.5 dB SI-SDR gain over baseline source extraction methods.
The paper addresses the problem of musical source extraction by integrating a score-matching diffusion model into a deterministic architecture to enhance audio quality. The authors focus on improving the performance of deterministic models, which often suffer from imperfect separation and source leakage due to the intricate relationships between musical sources and masking effects. To mitigate the slow iterative sampling inherent in diffusion models, the paper employs Consistency Distillation (CD) to reduce the sampling process to a single step, achieving comparable or superior performance.
Key aspects of the paper include:
- Problem Statement: Isolating specific sound sources from a mixture of audio signals, emphasizing the challenges posed by music's interdependent sources and masking effects.
- Methodology:
- A deterministic model is trained using a U-Net encoder-decoder architecture conditioned on the mixture signal xmix and instrument s. The loss function is defined as:
L(θ)=Es,xmix∥xs−x^sdet∥22
where:
- L(θ) is the loss function.
- xs is the true source.
- x^sdet is the prediction for source s.
- fθ is the deterministic source extraction model.
- A denoising score-matching diffusion model is then incorporated to enhance extracted sources. The diffusion model gϕ takes four inputs: a noisy version of xs, the target label s, the noise scale σ, and intermediate features xˉsdet extracted by the deterministic model. The model is trained with the Denoising Score Matching (DSM) loss:
LDSM(ϕ)=Es,xmix,t∥xs−gϕ(xs+σtϵ,s,σt,xˉsdet)∥22
where:
- ϵ∼N(0,I) is Gaussian noise.
- σt:=σ(t) is a monotonically increasing function defining the noise step and scale.
- Consistency Distillation (CD) is applied to accelerate the generation process, training a consistency model gω using the pre-trained diffusion model gϕ as a teacher. The CD loss is:
LCD(ω)=Et,h∥gsg(ω)(x^s,t−hdif,s,σt−h,xˉsdet)target−gω(xs+ϵσt,s,σt,xˉsdet)prediction∥22
where:
- h∈[1,t] is the number of ODE steps used in the distillation process.
- sg(ω) denotes the stop-gradient running EMA of ω.
- The final loss function combines the CD loss with the DSM loss:
L(ω)=LCD(ω)+λDSMLDSM(ω)
where λDSM is a balancing term.
- Dataset: The models were trained and tested on the Slakh2100 dataset, focusing on bass, drums, guitar, and piano.
- Baselines: The proposed model is compared against Demucs, Demucs+Gibbs, and MSDM.
- Evaluation Metric: Scale-invariant SDR improvement (SI-SDRI) is used to evaluate performance, focusing on chunks of 4 seconds in length with a 2-second overlap.
Key results include:
- The deterministic model outperformed baseline models, attributed to its architecture using self-attention instead of LSTM.
- Incorporating a diffusion model further improved separation quality by approximately 1.2 dB.
- The CD model accelerated performance and further improved quality, outperforming the diffusion model without adversarial training. Specifically, a four-step CD model achieved a 3 dB improvement over the deterministic model and a 6.5 dB improvement over the best baseline.
The paper claims three main contributions: bridging deterministic and generative methods, introducing a consistency framework to the raw audio domain, and achieving significant improvements in musical source extraction on the Slakh2100 dataset.
The authors conclude that mixed deterministic and generative models hold strong potential for advancing source extraction and separation, with future work focused on reducing audio segment lengths and model size to enable lightweight, real-time applications.
The paper uses the Slakh2100 dataset, comprising 2100 tracks with a split of 1500/375/225 for training, validation, and testing, respectively. The audio is downsampled to 22kHz, and segments of N=262144 samples (approximately 11.9 seconds) are used. The models focus on bass (94.7\% presence), drums (99.3\%), guitar (100.0\%), and piano (99.3\%).
The deterministic, diffusion, and consistency models are all based on a U-Net backbone. The U-Net for the deterministic model was modified to accept a one-hot encoded instrument label. The diffusion model incorporates the deterministic model as a feature extractor. The consistency model is initialized using the pre-trained weights of the diffusion model. The Adam optimizer is used for training the deterministic and diffusion models, while the Rectified Adam optimizer is used for the consistency model.
The inference time, number of parameters, and real-time factors (RTF) are compared against baselines. The deterministic model demonstrates the best speed, with an inference time of 0.114 seconds and an RTF of 0.009. The diffusion model, while slower, outperforms generative baselines. The CD models, especially at T=1, achieve an RTF of 0.023.
The authors also explored the performance of the models on the MUSDB18 dataset.