Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 30 tok/s

Gemini 2.5 Pro 46 tok/s Pro

GPT-5 Medium 18 tok/s Pro

GPT-5 High 12 tok/s Pro

GPT-4o 91 tok/s Pro

Kimi K2 184 tok/s Pro

GPT OSS 120B 462 tok/s Pro

Claude Sonnet 4 36 tok/s Pro

2000 character limit reached

Speech Denoising in the Waveform Domain with Self-Attention (2202.07790v3)

Published 15 Feb 2022 in cs.SD, cs.LG, and eess.AS

Abstract: In this work, we present CleanUNet, a causal speech denoising model on the raw waveform. The proposed model is based on an encoder-decoder architecture combined with several self-attention blocks to refine its bottleneck representations, which is crucial to obtain good results. The model is optimized through a set of losses defined over both waveform and multi-resolution spectrograms. The proposed method outperforms the state-of-the-art models in terms of denoised speech quality from various objective and subjective evaluation metrics. We release our code and models at https://github.com/nvidia/cleanunet.

Citations (50)

View on Semantic Scholar

Collections

Summary

The paper presents CleanUNet, a novel causal speech denoising model that integrates U-Net architecture with masked self-attention for real-time waveform processing.
It employs dual-loss optimization, combining ℓ1 waveform loss and multi-resolution STFT loss to preserve high-frequency details and improve intelligibility.
Experiments demonstrate that CleanUNet outperforms existing models in PESQ, STOI, and perceptual evaluations on multiple challenging datasets.

An Analysis of CleanUNet: Causal Speech Denoising in the Waveform Domain

The paper presented in this article introduces CleanUNet, a novel model for causal speech denoising that operates directly on raw waveforms, leveraging the strengths of a U-Net architecture bolstered with self-attention mechanisms. The paper targets the modeling imperfections faced by traditional speech denoising techniques and proposes a solution optimized for both objective and subjective performance measures.

Model Architecture and Loss Optimization

CleanUNet capitalizes on the U-Net architecture, characterized by its encoder-decoder configuration with skip connections, which is tailored for tasks requiring dense predictions. The model enhances the U-Net structure by incorporating masked self-attention blocks in its bottleneck layers to refine waveform representations. These modifications are pivotal in optimizing the causal nature of the process and improving the quality of denoised speech output. Critically, the architecture exclusively uses causal convolutions to facilitate real-time configurations, crucial for applications in teleconferencing and audio calls.

The performance of CleanUNet is further fine-tuned through a dual-loss paradigm involving the $\ell_1$ loss on waveforms and a multi-resolution Short-Time Fourier Transform (STFT) loss. The application of a high-band STFT loss ensures the fidelity of high-frequency components in speech, addressing the potential frequency imbalance induced by a full-band approach. These designed loss functions serve to bridge the outcomes between perceptual signal quality and signal restoration precision, as evaluated by objective criteria like PESQ and STOI.

Experimental Evaluation and Results

The competitiveness of CleanUNet is substantiated across multiple datasets, such as the DNS, Valentini, and an internal dataset curated for challenging denoising conditions. As documented in the experimental findings, CleanUNet outperforms leading models in the domain on several objective metrics, including PESQ (both wideband and narrowband) and STOI, indicating advancements in clarity and intelligibility of the processed speech. Subjectively, CleanUNet also demonstrates enhanced SIG and OVRL metrics, confirming its acceptance in perceptual assessments.

The empirical results are noteworthy, particularly CleanUNet's superior performance over existing prominent methods, like FAIR-denoiser, in both objective measures and listener-based evaluations. Furthermore, the paper emphasizes the effectiveness of architectural choices, specifically the depth of self-attention layers and the exclusion of resampling processes, which contribute to the optimal operation of the model.

Theoretical and Practical Implications

The CleanUNet framework presents significant implications for both the advancement of speech processing techniques and real-time application development. The introduction of causal operations within an encoder-decoder paradigm paves the way for more seamless integration into live audio processing systems where minimizing latency is critical. The research bolsters existing literature by concretely demonstrating the benefit of fused self-attention mechanisms within conventional architectures, potentially inspiring analogous adaptations across other domains utilizing waveform modeling.

In practice, CleanUNet could greatly benefit fields relying on high-fidelity speech signals, including adaptive hearing devices, voice-controlled systems, and digital communications platforms. The scalable nature of the proposed architecture also suggests potential beneficial applications in broader audio processing contexts beyond speech, offering a template for non-stationary noise transformation.

Conclusion and Future Directions

The development of CleanUNet represents a significant step in waveform-domain speech denoising techniques, blending architectural innovations with intricate loss formulations to produce state-of-the-art outcomes. Future work could explore the extension of CleanUNet's principles to address other complex auditory signal scenarios or integrate it with more extensive personalized noise profiles for adaptive applications. Additionally, further investigations could seek to harmonize the computational efficiency of CleanUNet, fostering its capacity to operate within more constrained computational environments while maintaining accuracy and perceptual quality.