Noise Tokens: Learning Neural Noise Templates for Environment-Aware Speech Enhancement (2004.04001v1)

Published 8 Apr 2020 in eess.AS

Abstract: In recent years, speech enhancement (SE) has achieved impressive progress with the success of deep neural networks (DNNs). However, the DNN approach usually fails to generalize well to unseen environmental noise that is not included in the training. To address this problem, we propose "noise tokens" (NTs), which are a set of neural noise templates that are jointly trained with the SE system. NTs dynamically capture the environment variability and thus enable the DNN model to handle various environments to produce STFT magnitude with higher quality. Experimental results show that using NTs is an effective strategy that consistently improves the generalization ability of SE systems across different DNN architectures. Furthermore, we investigate applying a state-of-the-art neural vocoder to generate waveform instead of traditional inverse STFT (ISTFT). Subjective listening tests show the residual noise can be significantly suppressed through mel-spectrogram correction and vocoder-based waveform synthesis.

Citations (13)

View on Semantic Scholar

Summary

The paper introduces noise tokens as neural noise templates co-trained with speech enhancement systems to dynamically adapt to unseen environmental noise.
It systematically evaluates three architectures—BLSTM, VoiceFilter, and Transformer—demonstrating superior performance over traditional noise-aware training via improved PESQ and STOI scores.
The study further refines noise suppression by integrating a WaveRNN-based neural vocoder for waveform generation, enhancing real-world speech processing applications.

Noise Tokens: Learning Neural Noise Templates for Environment-Aware Speech Enhancement

The paper "Noise Tokens: Learning Neural Noise Templates for Environment-Aware Speech Enhancement" presents a novel approach to enhancing the generalization ability of speech enhancement (SE) systems in response to unseen environmental noise. Traditional deep neural networks (DNNs), although successful, often face limitations when encountering new noise profiles outside their training data. The researchers propose the concept of "noise tokens" (NTs)—a set of neural noise templates that are co-trained with the SE system, aiming to offer a dynamic solution by adapting to environmental changes.

Key Contributions

Noise Tokens (NTs) Introduction: The paper adapts the concept of style tokens from expressive speech synthesis to the domain of SE. This involves designing a noise token layer combined with trainable noise templates, enabling the system to dynamically extract and adapt to noise conditions by creating a noise embedding that informs the SE model.
Three Architectural Variants for SE Model Testing: The paper systematically evaluated the efficacy of NTs across various state-of-the-art DNN architectures: BLSTM, VoiceFilter, and Transformer-based models, examining their impact on performance enhancement.
Waveform Generation via Neural Vocoder: The paper explores utilizing a neural vocoder, specifically WaveRNN, to generate waveforms as opposed to utilizing the traditional Inverse Short-Time Fourier Transform (ISTFT). This methodological switch is reported to facilitate better noise suppression and reduce phase distortion post-enhancement.

Experimental Validation

The authors conducted comprehensive experiments to validate the effectiveness of the proposed NTs. Results consistently demonstrated that integrating NTs leads to a substantial improvement in the generalization capacity of SE systems when tested on diverse, previously unseen noise types. Notably, the NTs exceeded the performance enhancements offered by dynamic noise-aware training (DNAT), indicating the superiority of learned neural noise embeddings over conventional signal processing approaches.

The performance metrics used for evaluation included PESQ (Perceptual Evaluation of Speech Quality) and STOI (Short-Time Objective Intelligibility). The use of NTs consistently yielded higher scores across these metrics compared to baseline and comparison models without NTs.

Implications and Future Directions

The proposed noise token approach offers significant practical implications for developing more robust speech enhancement systems. By effectively adapting to dynamic noise environments, this approach enhances the utility of SE systems in real-world applications like hearing aids, speech communication, and automatic speech recognition. The insights from this research could accelerate advancements in neural vocoder technology and broaden its applications in SE by improving quality and noise suppression further.

Future work could focus on enhancing the stability of vocoder-generated speech, as current limitations are notable in certain edge cases. The integration of the waveform generation module with the noise tokens, aiming to further augment the noise suppression capabilities, represents another promising avenue for research. Additionally, exploring alternative neural vocoders or hybrid configurations could potentially optimize performance and address nuances that existing systems face.

In conclusion, the introduction of noise tokens signifies a forward stride in the capabilities of SE models, suggesting a paradigm shift from static models to those that can dynamically learn and adapt to their acoustic surroundings. This approach not only aligns with current trends in AI-driven adaptation but also lays the groundwork for future explorations into fully environment-aware speech processing technologies.

PDF Markdown

Related Papers

YouTube

Show All Videos