Conditional Latent Diffusion-Based Speech Enhancement Via Dual Context Learning (2501.10052v1)

Published 17 Jan 2025 in cs.SD and eess.AS

Abstract: Recently, the application of diffusion probabilistic models has advanced speech enhancement through generative approaches. However, existing diffusion-based methods have focused on the generation process in high-dimensional waveform or spectral domains, leading to increased generation complexity and slower inference speeds. Additionally, these methods have primarily modelled clean speech distributions, with limited exploration of noise distributions, thereby constraining the discriminative capability of diffusion models for speech enhancement. To address these issues, we propose a novel approach that integrates a conditional latent diffusion model (cLDM) with dual-context learning (DCL). Our method utilizes a variational autoencoder (VAE) to compress mel-spectrograms into a low-dimensional latent space. We then apply cLDM to transform the latent representations of both clean speech and background noise into Gaussian noise by the DCL process, and a parameterized model is trained to reverse this process, conditioned on noisy latent representations and text embeddings. By operating in a lower-dimensional space, the latent representations reduce the complexity of the generation process, while the DCL process enhances the model's ability to handle diverse and unseen noise environments. Our experiments demonstrate the strong performance of the proposed approach compared to existing diffusion-based methods, even with fewer iterative steps, and highlight the superior generalization capability of our models to out-of-domain noise datasets (https://github.com/modelscope/ClearerVoice-Studio).

Summary

The paper introduces cLDM+DCL as its main contribution to enhance speech through a conditional latent diffusion model combined with dual context learning.
It reduces computational complexity by operating in a low-dimensional latent space using a variational autoencoder and leverages text embeddings for noise-speech separation.
Experiments demonstrate superior performance over baselines, particularly for unseen noise environments, validated by metrics like PESQ, ESTOI, SI-SDR, WV-MOS, and DNSMOS.

The paper introduces a novel speech enhancement method using a conditional latent diffusion model (cLDM) with dual-context learning (DCL). The approach addresses limitations in existing diffusion-based methods, such as high computational complexity and insufficient modeling of noise distributions. The cLDM operates in a lower-dimensional latent space, achieved through a variational autoencoder (VAE), to reduce the complexity of the generation process. The DCL scheme enhances the model's ability to handle diverse and unseen noise environments by modeling both clean speech and background noise distributions.

The paper details the system architecture, which includes a VAE, a cLDM, and a vocoder. The VAE encoder projects mel-spectrograms of noisy speech $y$ , clean speech $x$ , and noise $n$ into low-dimensional latent representations $z_X$ , $z_Y$ , and $z_N$ , respectively. The cLDM then learns the distributions of $z_X$ and $z_N$ , guided by a text embedding $\tau$ . For inference, the cLDM generates the speech prior $z_X$ conditioned on $z_Y$ and $\tau$ , which is then decoded by the VAE decoder and converted back into the waveform domain by the vocoder.

The cLDM employs forward and reverse processes to approximate the conditional data distribution $q(z_0|z_Y, \tau)$ with the learned model distribution $p_\theta(z_0|z_Y, \tau)$ . The forward process gradually transforms the data distribution into a standard Gaussian distribution through a Markov chain, with transition probability defined as:

$q(z_t|z_{t-1}) = \mathcal{N}(z_t; \sqrt{1 - \beta_t}z_{t-1}, \beta_t\mathbf{I})$

$q(z_t|z_0) = \mathcal{N}(z_t; \sqrt{\bar{\alpha}_t}z_0, (1 - \bar{\alpha}_t)\mathbf{\epsilon})$

where:

$z_t$ is the latent variable at time step $t$
$\beta_t$ is the noise schedule
$\mathbf{I}$ is the identity matrix
$\bar{\alpha}_t := \prod_{i=1}^{t}(1 - \beta_i)$ is the noise level at each step
$\mathbf{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ is the injected noise

The reverse process refines the speech through successive iterations $z_{[0:T-1]}$ based on learned conditional transition distributions:

$p_\theta(z_{t-1}|z_t, z_Y, \tau) = \mathcal{N}(z_{t-1}; \mathbf{\mu}_\mathbf{\theta}^{(t)}(z_t, z_Y, \tau), \mathbf{\sigma}_t^2\mathbf{I})$

where the mean $\mathbf{\mu}_\mathbf{\theta}^{(t)}(z_t, z_Y, \tau)$ and variance $\mathbf{\sigma}_t^2$ are parameterized as:

$\mathbf{\mu}_\mathbf{\theta}^{(t)}(z_t, z_Y, \tau) = \frac{1}{\sqrt{\alpha_t}} \left( z_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \hat{\mathbf{\epsilon}}_\mathbf{\theta}^{(t)}(z_t, z_Y, \tau) \right)$

$\mathbf{\sigma}_t^2 = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t$

where:

$\alpha_t = 1 - \beta_t$
$\hat{\mathbf{\epsilon}}_\mathbf{\theta}^{(t)}(z_t, z_Y, \tau)$ is the parameterized noise estimation using a U-Net model

The U-Net model is optimized using the following reweighted training loss:

$\mathcal{L}_{cLDM} = \sum_{t=1}^{T} \gamma_t \mathbb{E}_{\mathbf{\epsilon}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{I}), z_0} \left\| \mathbf{\epsilon} - \hat{\mathbf{\epsilon}}_\mathbf{\theta}^{(t)}(z_t, z_Y, \tau) \right\|$

where:

$\gamma_t$ denotes the weight of reverse step $t$

The DCL scheme trains a shared cLDM to generate both the speech prior $z_X$ and the noise prior $z_N$ using generated noisy-clean data $\mathcal{D}_{y,x}$ and noisy-noise data $\mathcal{D}_{y,n}$ . The texts "Speech enhancement" and "Background noise estimation" guide the generation process for speech and noise, respectively, and are converted into embeddings using a pre-trained T5 model.

The VAE model consists of an encoder and decoder built with stacked convolutional modules and is retrained on clean speech, noisy speech, and background noise data. BigVGAN is employed as the vocoder to generate speech samples from the enhanced mel-spectrogram and is retrained using only clean speech.

Experiments were conducted using the LibriSpeech corpus for clean speech and the AudioSet corpus for noise data. The training set comprised 360 hours of speech and 250 hours of noise. Five noise types (laughing, gunshot, singing, car engine, and rain) were reserved as unseen noises for testing. Noisy-clean pairs and noisy-noise pairs were generated with varying Signal-to-Noise Ratio (SNR) levels. Performance was evaluated using Perceptual Evaluation of Speech Quality (PESQ), extended short-term objective intelligibility (ESTOI), scale-invariant signal-to-distortion ratio (SI-SDR), Wav2Vec MOS (WV-MOS), and Deep Noise Suppression MOS (DNSMOS). The proposed method, cLDM+DCL, was compared against several baselines, including CDiffuSE, SGMSE+, StoRM, NASE, Conv-TasNet, and MetricGAN+.

Ablation studies examined the impact of the number of reverse process steps $T$ and the DCL scheme. Results showed that cLDM+DCL benefits from an increased number of reverse diffusion steps while maintaining a low real-time factor (RTF). Performance comparisons on seen-noise and unseen-noise test sets demonstrated that cLDM+DCL outperforms other diffusion-based methods, particularly on unseen noises, highlighting the effectiveness of learning noise distributions.

The paper concludes that the cLDM+DCL framework improves speech enhancement by operating in a low-dimensional latent space and effectively handling diverse noise environments through the DCL scheme. Experimental results validate the effectiveness of the approach on both seen and unseen noise conditions, as well as on out-of-domain datasets such as VoiceBank+DEMAND and DNS Challenge 2020.

PDF Markdown

Related Papers

GitHub

GitHub - modelscope/ClearerVoice-Studio: An AI-Powered Speech Processing Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Enhancement, Separation, and Target Speaker Extraction, etc. (2,038 stars)

Tweets

https://twitter.com/AudioAndSpeech/status/1881323597651751027