Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Conditional Latent Diffusion-Based Speech Enhancement Via Dual Context Learning (2501.10052v1)

Published 17 Jan 2025 in cs.SD and eess.AS

Abstract: Recently, the application of diffusion probabilistic models has advanced speech enhancement through generative approaches. However, existing diffusion-based methods have focused on the generation process in high-dimensional waveform or spectral domains, leading to increased generation complexity and slower inference speeds. Additionally, these methods have primarily modelled clean speech distributions, with limited exploration of noise distributions, thereby constraining the discriminative capability of diffusion models for speech enhancement. To address these issues, we propose a novel approach that integrates a conditional latent diffusion model (cLDM) with dual-context learning (DCL). Our method utilizes a variational autoencoder (VAE) to compress mel-spectrograms into a low-dimensional latent space. We then apply cLDM to transform the latent representations of both clean speech and background noise into Gaussian noise by the DCL process, and a parameterized model is trained to reverse this process, conditioned on noisy latent representations and text embeddings. By operating in a lower-dimensional space, the latent representations reduce the complexity of the generation process, while the DCL process enhances the model's ability to handle diverse and unseen noise environments. Our experiments demonstrate the strong performance of the proposed approach compared to existing diffusion-based methods, even with fewer iterative steps, and highlight the superior generalization capability of our models to out-of-domain noise datasets (https://github.com/modelscope/ClearerVoice-Studio).

Summary

  • The paper introduces cLDM+DCL as its main contribution to enhance speech through a conditional latent diffusion model combined with dual context learning.
  • It reduces computational complexity by operating in a low-dimensional latent space using a variational autoencoder and leverages text embeddings for noise-speech separation.
  • Experiments demonstrate superior performance over baselines, particularly for unseen noise environments, validated by metrics like PESQ, ESTOI, SI-SDR, WV-MOS, and DNSMOS.

The paper introduces a novel speech enhancement method using a conditional latent diffusion model (cLDM) with dual-context learning (DCL). The approach addresses limitations in existing diffusion-based methods, such as high computational complexity and insufficient modeling of noise distributions. The cLDM operates in a lower-dimensional latent space, achieved through a variational autoencoder (VAE), to reduce the complexity of the generation process. The DCL scheme enhances the model's ability to handle diverse and unseen noise environments by modeling both clean speech and background noise distributions.

The paper details the system architecture, which includes a VAE, a cLDM, and a vocoder. The VAE encoder projects mel-spectrograms of noisy speech yy, clean speech xx, and noise nn into low-dimensional latent representations zXz_X, zYz_Y, and zNz_N, respectively. The cLDM then learns the distributions of zXz_X and zNz_N, guided by a text embedding τ\tau. For inference, the cLDM generates the speech prior zXz_X conditioned on zYz_Y and τ\tau, which is then decoded by the VAE decoder and converted back into the waveform domain by the vocoder.

The cLDM employs forward and reverse processes to approximate the conditional data distribution q(z0zY,τ)q(z_0|z_Y, \tau) with the learned model distribution pθ(z0zY,τ)p_\theta(z_0|z_Y, \tau). The forward process gradually transforms the data distribution into a standard Gaussian distribution through a Markov chain, with transition probability defined as:

q(ztzt1)=N(zt;1βtzt1,βtI)q(z_t|z_{t-1}) = \mathcal{N}(z_t; \sqrt{1 - \beta_t}z_{t-1}, \beta_t\mathbf{I})

q(ztz0)=N(zt;αˉtz0,(1αˉt)ϵ)q(z_t|z_0) = \mathcal{N}(z_t; \sqrt{\bar{\alpha}_t}z_0, (1 - \bar{\alpha}_t)\mathbf{\epsilon})

where:

  • ztz_t is the latent variable at time step tt
  • βt\beta_t is the noise schedule
  • I\mathbf{I} is the identity matrix
  • αˉt:=i=1t(1βi)\bar{\alpha}_t := \prod_{i=1}^{t}(1 - \beta_i) is the noise level at each step
  • ϵN(0,I)\mathbf{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) is the injected noise

The reverse process refines the speech through successive iterations z[0:T1]z_{[0:T-1]} based on learned conditional transition distributions:

pθ(zt1zt,zY,τ)=N(zt1;μθ(t)(zt,zY,τ),σt2I)p_\theta(z_{t-1}|z_t, z_Y, \tau) = \mathcal{N}(z_{t-1}; \mathbf{\mu}_\mathbf{\theta}^{(t)}(z_t, z_Y, \tau), \mathbf{\sigma}_t^2\mathbf{I})

where the mean μθ(t)(zt,zY,τ)\mathbf{\mu}_\mathbf{\theta}^{(t)}(z_t, z_Y, \tau) and variance σt2\mathbf{\sigma}_t^2 are parameterized as:

μθ(t)(zt,zY,τ)=1αt(ztβt1αˉtϵ^θ(t)(zt,zY,τ))\mathbf{\mu}_\mathbf{\theta}^{(t)}(z_t, z_Y, \tau) = \frac{1}{\sqrt{\alpha_t}} \left( z_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \hat{\mathbf{\epsilon}}_\mathbf{\theta}^{(t)}(z_t, z_Y, \tau) \right)

σt2=1αˉt11αˉtβt\mathbf{\sigma}_t^2 = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t

where:

  • αt=1βt\alpha_t = 1 - \beta_t
  • ϵ^θ(t)(zt,zY,τ)\hat{\mathbf{\epsilon}}_\mathbf{\theta}^{(t)}(z_t, z_Y, \tau) is the parameterized noise estimation using a U-Net model

The U-Net model is optimized using the following reweighted training loss:

LcLDM=t=1TγtEϵtN(0,I),z0ϵϵ^θ(t)(zt,zY,τ)\mathcal{L}_{cLDM} = \sum_{t=1}^{T} \gamma_t \mathbb{E}_{\mathbf{\epsilon}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{I}), z_0} \left\| \mathbf{\epsilon} - \hat{\mathbf{\epsilon}}_\mathbf{\theta}^{(t)}(z_t, z_Y, \tau) \right\|

where:

  • γt\gamma_t denotes the weight of reverse step tt

The DCL scheme trains a shared cLDM to generate both the speech prior zXz_X and the noise prior zNz_N using generated noisy-clean data Dy,x\mathcal{D}_{y,x} and noisy-noise data Dy,n\mathcal{D}_{y,n}. The texts "Speech enhancement" and "Background noise estimation" guide the generation process for speech and noise, respectively, and are converted into embeddings using a pre-trained T5 model.

The VAE model consists of an encoder and decoder built with stacked convolutional modules and is retrained on clean speech, noisy speech, and background noise data. BigVGAN is employed as the vocoder to generate speech samples from the enhanced mel-spectrogram and is retrained using only clean speech.

Experiments were conducted using the LibriSpeech corpus for clean speech and the AudioSet corpus for noise data. The training set comprised 360 hours of speech and 250 hours of noise. Five noise types (laughing, gunshot, singing, car engine, and rain) were reserved as unseen noises for testing. Noisy-clean pairs and noisy-noise pairs were generated with varying Signal-to-Noise Ratio (SNR) levels. Performance was evaluated using Perceptual Evaluation of Speech Quality (PESQ), extended short-term objective intelligibility (ESTOI), scale-invariant signal-to-distortion ratio (SI-SDR), Wav2Vec MOS (WV-MOS), and Deep Noise Suppression MOS (DNSMOS). The proposed method, cLDM+DCL, was compared against several baselines, including CDiffuSE, SGMSE+, StoRM, NASE, Conv-TasNet, and MetricGAN+.

Ablation studies examined the impact of the number of reverse process steps TT and the DCL scheme. Results showed that cLDM+DCL benefits from an increased number of reverse diffusion steps while maintaining a low real-time factor (RTF). Performance comparisons on seen-noise and unseen-noise test sets demonstrated that cLDM+DCL outperforms other diffusion-based methods, particularly on unseen noises, highlighting the effectiveness of learning noise distributions.

The paper concludes that the cLDM+DCL framework improves speech enhancement by operating in a low-dimensional latent space and effectively handling diverse noise environments through the DCL scheme. Experimental results validate the effectiveness of the approach on both seen and unseen noise conditions, as well as on out-of-domain datasets such as VoiceBank+DEMAND and DNS Challenge 2020.

X Twitter Logo Streamline Icon: https://streamlinehq.com