Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ROSE-CD: One-Step Speech Enhancement

Updated 12 July 2025
  • The paper introduces ROSE-CD, a one-step, diffusion-based model that uses a novel consistency distillation strategy to directly recover clean speech and drastically reduce inference latency.
  • It leverages randomized learning trajectories and joint time-domain auxiliary losses (PESQ and SI-SDR) to correct teacher-induced biases and enhance both objective and perceptual quality.
  • Experimental results on datasets like VoiceBank-DEMAND demonstrate state-of-the-art performance with a 54× speedup and robust generalization to out-of-domain and real-world noisy recordings.

ROSE-CD: Robust One-step Speech Enhancement via Consistency Distillation is a one-step, diffusion-based speech enhancement method that uses a novel form of consistency distillation to enable real-time, high-quality denoising with significantly reduced inference latency. Unlike prior diffusion-based models—which typically require dozens or hundreds of sampling steps—ROSE-CD achieves state-of-the-art enhancement in a single step, surpassing its teacher model in both objective and perceptual quality.

1. Background and Motivation

Diffusion models for speech enhancement have demonstrated strong performance by progressively denoising a noisy signal through multiple iterative steps. However, their practical deployment in real-time applications is hindered by the high computational cost of multi-step inference. Consistency distillation has emerged as a promising approach for compressing the multi-step denoising process into a single forward pass by training a “consistency model” to directly map a noisy input (at an arbitrary diffusion time) back to the clean speech.

Despite this advance, previous consistency distillation schemes for speech enhancement suffer from limited robustness and are often constrained by residual biases arising from the sampling trajectory of the teacher diffusion model. These biases can propagate errors and make distilled models susceptible to inaccuracies present in the teacher’s predictions.

2. Methodology and Theoretical Foundations

At the core of ROSE-CD is a novel consistency distillation strategy designed to mitigate the bias and robustness limitations of standard consistency distillation. The approach can be summarized as follows:

  • Teacher Model: The teacher is a multi-step, score-based diffusion model trained with a score matching objective. The reverse stochastic differential equation (SDE) governing the denoising process is:

dxt=[f(xt,y)g(t)2xtlogpt(xty)]dt+g(t)dwˉdx_t = [f(x_t, y) - g(t)^2 \nabla_{x_t} \log p_t(x_t|y)] dt + g(t) d\bar{w}

Given the intractability of directly computing the conditional score, the denoising score network is optimized as:

Lscore=λ(t)sθ(xt,y,t)+z/σ(t)22\mathcal{L}_{\text{score}} = \lambda(t) \| s_\theta(x_t, y, t) + z / \sigma(t) \|_2^2

  • Consistency Distillation: The distillation process is based on matching adjacent states along the ODE trajectory traversed by the teacher. For consecutive time points tnt_n and tn1t_{n-1} with a small time step Δt=tntn1\Delta t = t_n - t_{n-1}, the teacher’s ODE is numerically integrated (Heun or Euler solver) to estimate the preceding state:

x^tn1=xtn+(tn1tn)Φ(xtn,y,tn)\hat{x}_{t_{n-1}} = x_{t_n} + (t_{n-1} - t_n)\Phi(x_{t_n}, y, t_n)

The consistency model is then trained to directly map xtnx_{t_n} and (y,tn)(y, t_n) to the clean target output.

  • Randomized Learning Trajectory: Distilled consistency models are naturally sculpted by the teacher’s ODE trajectory, which can introduce bias if the teacher contains any systematic error. ROSE-CD injects additional randomness into the target construction step:

x^tn1r=xtn+(tn1tn)Φ(xtn,y,tn)+g(t)Δtϵ\hat{x}^r_{t_{n-1}} = x_{t_n} + (t_{n-1} - t_n)\Phi(x_{t_n}, y, t_n) + g(t) \sqrt{\Delta t} \cdot \epsilon

where ϵN(0,I)\epsilon \sim \mathcal{N}(0, I). This randomized disturbance ensures that the distilled model must generalize across local perturbations in the trajectory, resulting in improved robustness to noise and better generalization to out-of-distribution data.

  • Joint Optimization with Time-domain Auxiliary Losses: ROSE-CD incorporates two time-domain auxiliary loss functions—PESQ and SI-SDR—into the consistency distillation objective:

L=LRCD+λ1LPESQ(x^θ(tn),x0)+λ2LSI-SDR(x^θ(tn),x0)\mathcal{L} = \mathcal{L}_{\text{RCD}} + \lambda_1 \mathcal{L}_{\text{PESQ}}(\hat{x}_\theta(t_n), x_0) + \lambda_2 \mathcal{L}_{\text{SI-SDR}}(\hat{x}_\theta(t_n), x_0)

where x^θ(tn)\hat{x}_\theta(t_n) is the reconstructed waveform obtained by applying the inverse STFT to the predicted spectrogram.

This composite objective provides explicit, human-relevant feedback (perceptual evaluation and distortion) to the distilled model, enabling it to recover from teacher-induced errors and potentially exceed the teacher’s performance.

3. Practical Implementation and Computational Considerations

ROSE-CD’s design translates into a highly efficient inference workflow. Unlike the teacher’s multi-step iterative procedure, the distilled model performs only a single reverse step from a randomly chosen time tnt_n back to t0t_0 (clean speech). The resulting speedup is 54× compared to the 30-step teacher.

Key implementation elements include:

  • Use of randomized time-step sampling during training to expose the model to a wide variety of initial conditions.
  • Application of stochastic noise augmentation during the construction of distillation targets to encourage robustness.
  • Joint use of differentiable time-domain losses, computed via iSTFT, to ensure accurate waveform reconstruction.
  • Retention of a single forward pass at inference, supporting real-time processing on standard hardware.

4. Experimental Results and Performance Analysis

On the VoiceBank-DEMAND dataset, ROSE-CD established new benchmarks for:

  • PESQ (Perceptual Evaluation of Speech Quality): Achieved scores up to 3.99 under specific loss weighting—surpassing the 30-step teacher model.
  • SI-SDR (Scale-Invariant Signal-to-Distortion Ratio): Achieved 17.80 dB.
  • MOS-SSL / WV-MOS (objective speech quality metrics): Achieved superior or state-of-the-art scores relative to prior predictive and diffusion models.
  • Inference speed: 54× faster than the teacher, supporting practical real-time deployment.

The model further demonstrated substantial improvements on out-of-domain datasets (TIMIT+NOISE92) and real-world DNS Challenge 2020 recordings. Visualizations and tabulated results in the original work document the consistency of the improvement across metrics and data domains.

A notable observation from the training process was the trade-off between PESQ and SI-SDR: optimizing exclusively for one can degrade the other. ROSE-CD’s joint optimization was shown to provide a robust balance, yielding overall superior performance.

5. Relation to Consistency Models and Prior Frameworks

ROSE-CD extends the line of consistency modeling approaches (e.g., SE-Bridge (2305.13796)), which seek to compress the denoising trajectory of a diffusion process into a single mapping via consistency constraints. However, while previous work focused primarily on mapping the deterministic PF-ODE trajectory and achieving consistency along it, ROSE-CD addresses and overcomes two lingering practical issues:

  • Bias Correction: By introducing random perturbations in the learning trajectory, ROSE-CD prevents the distilled consistency model from overfitting to the teacher’s specific ODE path, fostering improved generalization and error recovery.
  • Robustness via Auxiliary Losses: Time-domain, perceptually aligned auxiliary losses ensure that the one-step model is optimized for both objective signal fidelity and perceptual speech quality—directly addressing issues of teacher-induced suboptimal solutions.

6. Generalization and Real-world Applicability

Extensive experiments demonstrate that ROSE-CD generalizes well to out-of-domain datasets and real-world noisy recordings. Empirical results confirm its robustness to a wide range of noise conditions and distributions unobserved in training.

The one-step inference architecture, combined with the robustness-oriented training regimen, makes ROSE-CD particularly suitable for real-time applications such as voice communication systems, hearing aids, and edge deployment scenarios where computational latency and reliability are critical considerations.

7. Summary Table of Key Innovations

Component Description Role in ROSE-CD
Consistency Distillation Maps noisy state at arbitrary time tnt_n to an estimate of clean speech x0x_0 in a single step by mimicking the teacher’s ODE trajectory Enables fast (one-step) inference
Randomized Learning Trajectory Adds noise to target state in training Improves robustness; prevents bias
Time-domain Auxiliary Losses Jointly optimizes PESQ and SI-SDR in waveform domain Ensures high perceptual and signal quality
Empirical Speedup 54× faster than multi-step teacher model Real-time, practical enhancement
Generalization Validation Tested on out-of-domain and real-world data Demonstrates practical applicability

8. Implications and Future Directions

ROSE-CD establishes a new paradigm for robust, efficient, and high-quality speech enhancement using one-step consistency distillation. Its randomized trajectory-based robustness and auxiliary loss framework provide a foundation for further research in consistency-based generative models, real-time audio processing, and cross-domain application to other sequence enhancement tasks. The approach suggests that similar strategies may be applied to other domains where diffusion-based models are hindered by iterative inference bottlenecks (2507.05688).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)