Papers
Topics
Authors
Recent
Search
2000 character limit reached

Attention is All You Need in Speech Separation

Published 25 Oct 2020 in eess.AS, cs.LG, cs.SD, and eess.SP | (2010.13154v2)

Abstract: Recurrent Neural Networks (RNNs) have long been the dominant architecture in sequence-to-sequence learning. RNNs, however, are inherently sequential models that do not allow parallelization of their computations. Transformers are emerging as a natural alternative to standard RNNs, replacing recurrent computations with a multi-head attention mechanism. In this paper, we propose the SepFormer, a novel RNN-free Transformer-based neural network for speech separation. The SepFormer learns short and long-term dependencies with a multi-scale approach that employs transformers. The proposed model achieves state-of-the-art (SOTA) performance on the standard WSJ0-2/3mix datasets. It reaches an SI-SNRi of 22.3 dB on WSJ0-2mix and an SI-SNRi of 19.5 dB on WSJ0-3mix. The SepFormer inherits the parallelization advantages of Transformers and achieves a competitive performance even when downsampling the encoded representation by a factor of 8. It is thus significantly faster and it is less memory-demanding than the latest speech separation systems with comparable performance.

Citations (474)

Summary

  • The paper introduces SepFormer, a non-recurrent, attention-based model that uses Transformer blocks to replace RNNs for efficient speech separation.
  • It employs dual-path processing with IntraTransformer and InterTransformer blocks to capture both short and long-term dependencies in audio signals.
  • Experiments on WSJ0-2/3mix show SI-SNR improvements of 22.3 dB and 19.5 dB respectively, demonstrating significant computational efficiency gains.

Attention is All You Need in Speech Separation

Introduction

The paper "Attention is All You Need in Speech Separation" introduces the SepFormer, an innovative RNN-free architecture utilizing Transformers for speech separation tasks. Typically, RNNs have dominated sequence-to-sequence learning in speech separation, but their inherent sequential nature hinders parallel computation. The Transformer architecture, with its parallelizable multi-head attention mechanism, presents a promising alternative that circumvents these limitations. The SepFormer achieves state-of-the-art (SOTA) results on benchmarks like WSJ0-2/3mix datasets and showcases the potential advantages of employing Transformers in audio signal processing.

Model Architecture

The SepFormer leverages a dual-path processing framework, where RNNs commonly used in systems like Dual-Path RNN (DPRNN) are replaced with Transformer blocks.

  1. Encoder: The model begins with a convolutional encoder that transforms the input time-domain signal x∈RTx \in \mathbb{R}^T into an STFT-like representation h∈RF×T′h \in \mathbb{R}^{F \times T'}.
  2. Masking Network:
    • The Masking Network, detailed in (Figure 1)

Figure 1: The high-level architecture of the SepFormer including the Encoder, Masking Network, and Decoder.

processes the encoded representation to obtain masks for each speaker. It implements the dual-path strategy with IntraTransformer (short-term dependency modeling) and InterTransformer (long-term dependency modeling).

  1. SepFormer Block: Transformations within this block learn temporal dynamics with repeated IntraT and InterT arrangements that process overlapping chunks of the signal, minimizing Transformer scaling issues.

(Figure 2)

Figure 2: Detailed architecture of the SepFormer including IntraTransformer and InterTransformer blocks, which are pivotal for exploiting short and long-term dependencies.

  1. Decoder: Reconstructs the separated time-domain signals using a transposed convolution operation, applying the masks generated by the Masking Network to the encoded input.

Experimentation and Results

The SepFormer was evaluated using the WSJ0-2mix and WSJ0-3mix datasets. It achieved an SI-SNR improvement of 22.3 dB and 19.5 dB, respectively, in conditions augmented by dynamic mixing. These results surpass many RNN-based systems and even outperform others when augmented with additional data (e.g., Wavesplit with dynamic mixing).

Ablation Study

An analysis was performed to understand hyperparameters' effects, confirming that both IntraTransformer and InterTransformer blocks contribute significantly to performance. Dynamic mixing, dual-path processing, and the number of heads in the attention mechanism were fine-tuned for optimal results.

Resource Efficiency

Figure 3

Figure 3

Figure 3: The training curves of SepFormer, DPRNN, and DPTNeT on the WSJ0-2mix dataset, illustrating the superior speed of the SepFormer.

SepFormer benefits from the parallelism inherent in Transformer architectures, leading to faster training and inference speeds while maintaining lower memory usage compared to traditional RNN-based frameworks, as reflected in significant computation efficiency over models like DPRNN and DPTNet.

Conclusion

The SepFormer model, by capitalizing on the strengths of Transformer architectures, mitigates the constraints associated with RNN-based models. It demonstrates that SOTA speech separation can be achieved with entirely non-recurrent, attention-focused methods, offering considerable computational advantages. Future research directions may explore refined Transformer architectures to unlock further performance improvements and operational efficiencies.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.