T-GSA: Transformer with Gaussian-weighted self-attention for speech enhancement (1910.06762v3)

Published 13 Oct 2019 in eess.AS and cs.SD

Abstract: Transformer neural networks (TNN) demonstrated state-of-art performance on many NLP tasks, replacing recurrent neural networks (RNNs), such as LSTMs or GRUs. However, TNNs did not perform well in speech enhancement, whose contextual nature is different than NLP tasks, like machine translation. Self-attention is a core building block of the Transformer, which not only enables parallelization of sequence computation, but also provides the constant path length between symbols that is essential to learning long-range dependencies. In this paper, we propose a Transformer with Gaussian-weighted self-attention (T-GSA), whose attention weights are attenuated according to the distance between target and context symbols. The experimental results show that the proposed T-GSA has significantly improved speech-enhancement performance, compared to the Transformer and RNNs.

Citations (175)

View on Semantic Scholar

Summary

The paper introduces T-GSA, which leverages Gaussian-weighted self-attention to overcome traditional transformer limitations in speech enhancement.
The model adjusts attention weights based on symbol distance using a Gaussian profile, achieving significant improvements in SDR and PESQ metrics.
The study employs an end-to-end, multi-task denoising approach to enhance both signal quality and perceptual clarity, setting a new benchmark in speech processing.

T-GSA: Transformer with Gaussian-Weighted Self-Attention for Speech Enhancement

The paper "T-GSA: Transformer with Gaussian-Weighted Self-Attention for Speech Enhancement" addresses the limitations observed in conventional Transformer neural networks (TNN) when applied to speech enhancement tasks. Despite their success in many NLP applications, standard Transformers have struggled with acoustic signal processing due to inappropriate self-attention properties for speech signals. This paper introduces a novel architecture, T-GSA, which adapts the self-attention mechanism of Transformers using Gaussian weighting to improve performance in speech enhancement scenarios.

Evaluation and Results

The T-GSA model introduces Gaussian-Weighted Self-Attention (GSA) where attention weights are modified according to the distance between the target and context symbols, modeling them with a Gaussian distribution. The experimental evaluation on the QUT-NOISE-TIMIT corpus and VoiceBank-DEMAND corpus demonstrates significant improvements in Signal to Distortion Ratio (SDR) and Perceptual Evaluation of Speech Quality (PESQ) metrics compared to prior models—both Transformer-based and RNN-based approaches such as CNN-LSTM. Specifically, the T-GSA showed enhanced performance over previous transformer variants, notably outperforming attention biasing schemes and the original Transformer encoder, delivering superior results across various SNR ranges.

Key Contributions

Gaussian-weighted Self-Attention (GSA): The methodology attenuates attention based on symbol distance, reflecting the natural correlation properties of acoustic signals and distance-dependency inherent to speech data.
Complex Transformer Architecture: The extension of the Transformer model to handle both real and imaginary components of STFT signals, coupled with inter-path correlation exploitation, potentially enhances speech signal reconstruction, albeit demonstrating mixed results in PESQ scores.
End-to-End Metric Optimization: Adoption of a multi-task denoising scheme to jointly optimize SDR and PESQ metrics, effectively surpassing generative models such as SEGAN and WAVENET in speech quality enhancement tasks.

Implications and Future Directions

This paper's findings not only advance the state-of-art in speech enhancement technologies but also provide insights into adapting NLP models for non-textual data processing. The integration of Gaussian weighting illustrates a promising approach for tasks requiring localized context sensitivity. Future research could explore optimization techniques to stabilize complex model performance across different acoustic quality metrics and further refine attention mechanisms to enhance robustness against varying noise levels.

The practical application of such models may significantly benefit real-world speech processing applications, from voice-assisted technologies in noisy environments to improving audio intelligibility in telecommunication and auditory aids. The theoretical framework established provides a groundwork for developing more adaptive sequence models aimed at diverse data contexts beyond text.

PDF Markdown