Attention is All You Need in Speech Separation (2010.13154v2)

Published 25 Oct 2020 in eess.AS, cs.LG, cs.SD, and eess.SP

Abstract: Recurrent Neural Networks (RNNs) have long been the dominant architecture in sequence-to-sequence learning. RNNs, however, are inherently sequential models that do not allow parallelization of their computations. Transformers are emerging as a natural alternative to standard RNNs, replacing recurrent computations with a multi-head attention mechanism. In this paper, we propose the SepFormer, a novel RNN-free Transformer-based neural network for speech separation. The SepFormer learns short and long-term dependencies with a multi-scale approach that employs transformers. The proposed model achieves state-of-the-art (SOTA) performance on the standard WSJ0-2/3mix datasets. It reaches an SI-SNRi of 22.3 dB on WSJ0-2mix and an SI-SNRi of 19.5 dB on WSJ0-3mix. The SepFormer inherits the parallelization advantages of Transformers and achieves a competitive performance even when downsampling the encoded representation by a factor of 8. It is thus significantly faster and it is less memory-demanding than the latest speech separation systems with comparable performance.

View on arXiv

Authors (5)

Cem Subakan (35 papers)
Mirco Ravanelli (72 papers)
Samuele Cornell (41 papers)
Mirko Bronzi (5 papers)
Jianyuan Zhong (13 papers)

Citations (474)

View on Semantic Scholar

Summary

Analysis of "Attention Is All You Need in Speech Separation"

The paper "Attention Is All You Need in Speech Separation" introduces SepFormer, a novel Transformer-based model for speech separation that eliminates the need for recurrent neural networks (RNNs). This approach leverages the self-attention mechanism of Transformers to effectively capture dependencies in the audio signal, both locally and globally, advancing the field of audio source separation.

Model Architecture and Design

SepFormer brings forth a new paradigm by utilizing a dual-path processing framework that incorporates two primary Transformer units, IntraTransformer and InterTransformer, which handle short and long-term dependencies, respectively. This architectural design eradicates the recurrence overhead typical of RNNs, harnessing the parallelization potential of Transformer models to enhance computational efficiency.

The network's architecture is segmented into three critical components: the encoder, the masking network, and the decoder. The encoder contains a convolutional layer that processes input speech signals, creating an intermediate learned representation. The core of the masking network, the SepFormer block, replaces RNN-based modeling with Transformer layers that apply a multi-head self-attention mechanism, ensuring connections across distant segments of the audio data. The decoder is tasked with reconstructing the temporal domain signals from these learned features.

Results and Performance Evaluation

The proposed SepFormer model achieves state-of-the-art (SOTA) results on standard datasets such as WSJ0-2mix and WSJ0-3mix. Notably, it attains an SI-SNRi of 22.3 dB on WSJ0-2mix and 19.5 dB on WSJ0-3mix. These metrics demonstrate a clear performance enhancement over existing models, including DPRNN, DPTNet, and Wavesplit, even when those models utilize sophisticated strategies, like speaker-identity information.

SepFormer's performance advantage is further amplified by its capacity for efficient parallel processing, allowing faster training and inference with reduced memory usage. Comparisons show that, despite employing a larger total parameter count (26M), SepFormer outpaces its counterparts in speed, chiefly due to self-attention's inherent capacity for parallel execution combined with effective data downsampling.

Implications and Future Directions

The introduction of SepFormer presents pivotal implications for the arena of speech processing. By effectively demonstrating that RNN-free architectures can achieve SOTA results, this raises significant questions about the dependency on recurrent architectures and opens the door for Transformer-based models to dominate other sequence-based processing tasks beyond speech separation.

Practically, the deployment of SepFormer can lead to reduced computational costs and more accessible real-time speech separation systems. From a theoretical perspective, this research underscores the transformative potential of attention mechanisms in understanding and deconstructing sequential data.

Future developments could explore optimizing Transformer architectures further, reducing computational complexity, and enhancing adaptability across various audio domains. Given the promising outcomes of SepFormer, additional works may seek incorporation of other forms of attention mechanisms or hybrid models that leverage unique aspects of RNNs and Transformers.

In conclusion, this paper marks a significant stride in speech separation technology, advocating for a shift from tradition towards more innovative, attention-centric approaches which are both powerful and computationally efficient.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos