Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

133 tokens/sec

GPT-4o

12 tokens/sec

Gemini 2.5 Pro Pro

42 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

ROSE-CD: One-Step Speech Enhancement

Updated 12 July 2025

The paper introduces ROSE-CD, a one-step, diffusion-based model that uses a novel consistency distillation strategy to directly recover clean speech and drastically reduce inference latency.
It leverages randomized learning trajectories and joint time-domain auxiliary losses (PESQ and SI-SDR) to correct teacher-induced biases and enhance both objective and perceptual quality.
Experimental results on datasets like VoiceBank-DEMAND demonstrate state-of-the-art performance with a 54× speedup and robust generalization to out-of-domain and real-world noisy recordings.

ROSE-CD: Robust One-step Speech Enhancement via Consistency Distillation is a one-step, diffusion-based speech enhancement method that uses a novel form of consistency distillation to enable real-time, high-quality denoising with significantly reduced inference latency. Unlike prior diffusion-based models—which typically require dozens or hundreds of sampling steps—ROSE-CD achieves state-of-the-art enhancement in a single step, surpassing its teacher model in both objective and perceptual quality.

1. Background and Motivation

Diffusion models for speech enhancement have demonstrated strong performance by progressively denoising a noisy signal through multiple iterative steps. However, their practical deployment in real-time applications is hindered by the high computational cost of multi-step inference. Consistency distillation has emerged as a promising approach for compressing the multi-step denoising process into a single forward pass by training a “consistency model” to directly map a noisy input (at an arbitrary diffusion time) back to the clean speech.

Despite this advance, previous consistency distillation schemes for speech enhancement suffer from limited robustness and are often constrained by residual biases arising from the sampling trajectory of the teacher diffusion model. These biases can propagate errors and make distilled models susceptible to inaccuracies present in the teacher’s predictions.

2. Methodology and Theoretical Foundations

At the core of ROSE-CD is a novel consistency distillation strategy designed to mitigate the bias and robustness limitations of standard consistency distillation. The approach can be summarized as follows:

Teacher Model: The teacher is a multi-step, score-based diffusion model trained with a score matching objective. The reverse stochastic differential equation (SDE) governing the denoising process is:

$dx_t = [f(x_t, y) - g(t)^2 \nabla_{x_t} \log p_t(x_t|y)] dt + g(t) d\bar{w}$

Given the intractability of directly computing the conditional score, the denoising score network is optimized as:

$\mathcal{L}_{\text{score}} = \lambda(t) \| s_\theta(x_t, y, t) + z / \sigma(t) \|_2^2$

Consistency Distillation: The distillation process is based on matching adjacent states along the ODE trajectory traversed by the teacher. For consecutive time points $t_n$ and $t_{n-1}$ with a small time step $\Delta t = t_n - t_{n-1}$ , the teacher’s ODE is numerically integrated (Heun or Euler solver) to estimate the preceding state:

$\hat{x}_{t_{n-1}} = x_{t_n} + (t_{n-1} - t_n)\Phi(x_{t_n}, y, t_n)$

The consistency model is then trained to directly map $x_{t_n}$ and $(y, t_n)$ to the clean target output.

Randomized Learning Trajectory: Distilled consistency models are naturally sculpted by the teacher’s ODE trajectory, which can introduce bias if the teacher contains any systematic error. ROSE-CD injects additional randomness into the target construction step:

$\hat{x}^r_{t_{n-1}} = x_{t_n} + (t_{n-1} - t_n)\Phi(x_{t_n}, y, t_n) + g(t) \sqrt{\Delta t} \cdot \epsilon$

where $\epsilon \sim \mathcal{N}(0, I)$ . This randomized disturbance ensures that the distilled model must generalize across local perturbations in the trajectory, resulting in improved robustness to noise and better generalization to out-of-distribution data.

Joint Optimization with Time-domain Auxiliary Losses: ROSE-CD incorporates two time-domain auxiliary loss functions—PESQ and SI-SDR—into the consistency distillation objective:

$\mathcal{L} = \mathcal{L}_{\text{RCD}} + \lambda_1 \mathcal{L}_{\text{PESQ}}(\hat{x}_\theta(t_n), x_0) + \lambda_2 \mathcal{L}_{\text{SI-SDR}}(\hat{x}_\theta(t_n), x_0)$

where $\hat{x}_\theta(t_n)$ is the reconstructed waveform obtained by applying the inverse STFT to the predicted spectrogram.

This composite objective provides explicit, human-relevant feedback (perceptual evaluation and distortion) to the distilled model, enabling it to recover from teacher-induced errors and potentially exceed the teacher’s performance.

3. Practical Implementation and Computational Considerations

ROSE-CD’s design translates into a highly efficient inference workflow. Unlike the teacher’s multi-step iterative procedure, the distilled model performs only a single reverse step from a randomly chosen time $t_n$ back to $t_0$ (clean speech). The resulting speedup is 54× compared to the 30-step teacher.

Key implementation elements include:

Use of randomized time-step sampling during training to expose the model to a wide variety of initial conditions.
Application of stochastic noise augmentation during the construction of distillation targets to encourage robustness.
Joint use of differentiable time-domain losses, computed via iSTFT, to ensure accurate waveform reconstruction.
Retention of a single forward pass at inference, supporting real-time processing on standard hardware.

4. Experimental Results and Performance Analysis

On the VoiceBank-DEMAND dataset, ROSE-CD established new benchmarks for:

PESQ (Perceptual Evaluation of Speech Quality): Achieved scores up to 3.99 under specific loss weighting—surpassing the 30-step teacher model.
SI-SDR (Scale-Invariant Signal-to-Distortion Ratio): Achieved 17.80 dB.
MOS-SSL / WV-MOS (objective speech quality metrics): Achieved superior or state-of-the-art scores relative to prior predictive and diffusion models.
Inference speed: 54× faster than the teacher, supporting practical real-time deployment.

The model further demonstrated substantial improvements on out-of-domain datasets (TIMIT+NOISE92) and real-world DNS Challenge 2020 recordings. Visualizations and tabulated results in the original work document the consistency of the improvement across metrics and data domains.

A notable observation from the training process was the trade-off between PESQ and SI-SDR: optimizing exclusively for one can degrade the other. ROSE-CD’s joint optimization was shown to provide a robust balance, yielding overall superior performance.

5. Relation to Consistency Models and Prior Frameworks

ROSE-CD extends the line of consistency modeling approaches (e.g., SE-Bridge (2305.13796)), which seek to compress the denoising trajectory of a diffusion process into a single mapping via consistency constraints. However, while previous work focused primarily on mapping the deterministic PF-ODE trajectory and achieving consistency along it, ROSE-CD addresses and overcomes two lingering practical issues:

Bias Correction: By introducing random perturbations in the learning trajectory, ROSE-CD prevents the distilled consistency model from overfitting to the teacher’s specific ODE path, fostering improved generalization and error recovery.
Robustness via Auxiliary Losses: Time-domain, perceptually aligned auxiliary losses ensure that the one-step model is optimized for both objective signal fidelity and perceptual speech quality—directly addressing issues of teacher-induced suboptimal solutions.

6. Generalization and Real-world Applicability

Extensive experiments demonstrate that ROSE-CD generalizes well to out-of-domain datasets and real-world noisy recordings. Empirical results confirm its robustness to a wide range of noise conditions and distributions unobserved in training.

The one-step inference architecture, combined with the robustness-oriented training regimen, makes ROSE-CD particularly suitable for real-time applications such as voice communication systems, hearing aids, and edge deployment scenarios where computational latency and reliability are critical considerations.

7. Summary Table of Key Innovations

Component	Description	Role in ROSE-CD
Consistency Distillation	Maps noisy state at arbitrary time $t_n$ to an estimate of clean speech $x_0$ in a single step by mimicking the teacher’s ODE trajectory	Enables fast (one-step) inference
Randomized Learning Trajectory	Adds noise to target state in training	Improves robustness; prevents bias
Time-domain Auxiliary Losses	Jointly optimizes PESQ and SI-SDR in waveform domain	Ensures high perceptual and signal quality
Empirical Speedup	54× faster than multi-step teacher model	Real-time, practical enhancement
Generalization Validation	Tested on out-of-domain and real-world data	Demonstrates practical applicability

8. Implications and Future Directions

ROSE-CD establishes a new paradigm for robust, efficient, and high-quality speech enhancement using one-step consistency distillation. Its randomized trajectory-based robustness and auxiliary loss framework provide a foundation for further research in consistency-based generative models, real-time audio processing, and cross-domain application to other sequence enhancement tasks. The approach suggests that similar strategies may be applied to other domains where diffusion-based models are hindered by iterative inference bottlenecks (2507.05688).

PDF Markdown Chat (Upgrade)

References (2)

SE-Bridge: Speech Enhancement with Consistent Brownian Bridge (2023)

Robust One-step Speech Enhancement via Consistency Distillation (2025)