TF-Locoformer: Transformer with Local Modeling by Convolution for Speech Separation and Enhancement (2408.03440v1)

Published 6 Aug 2024 in eess.AS and cs.SD

Abstract: Time-frequency (TF) domain dual-path models achieve high-fidelity speech separation. While some previous state-of-the-art (SoTA) models rely on RNNs, this reliance means they lack the parallelizability, scalability, and versatility of Transformer blocks. Given the wide-ranging success of pure Transformer-based architectures in other fields, in this work we focus on removing the RNN from TF-domain dual-path models, while maintaining SoTA performance. This work presents TF-Locoformer, a Transformer-based model with LOcal-modeling by COnvolution. The model uses feed-forward networks (FFNs) with convolution layers, instead of linear layers, to capture local information, letting the self-attention focus on capturing global patterns. We place two such FFNs before and after self-attention to enhance the local-modeling capability. We also introduce a novel normalization for TF-domain dual-path models. Experiments on separation and enhancement datasets show that the proposed model meets or exceeds SoTA in multiple benchmarks with an RNN-free architecture.

Citations (3)

View on Semantic Scholar

Summary

The paper introduces TF-Locoformer, which integrates convolutional layers with Transformers to enhance TF-domain speech separation while addressing RNN limitations.
The model achieves impressive gains, delivering an SI-SNRi of up to 24.2 dB on datasets like WSJ0-2mix to outperform leading benchmarks.
Innovations such as the ConvSwiGLU module and RMSGroupNorm enable efficient local feature extraction and improved parallelization in complex speech processing tasks.

TF-Locoformer: Transformer with Local Modeling by Convolution for Speech Separation and Enhancement

The paper "TF-Locoformer: Transformer with Local Modeling by Convolution for Speech Separation and Enhancement" introduces a novel model, TF-Locoformer, designed for time-frequency (TF) domain speech separation. The model leverages the capabilities of Transformer architectures combined with local modeling via convolution, addressing the limitations of recurrent neural networks (RNNs) in such applications. This essay provides a comprehensive overview of the contributions and findings detailed in the paper, contextualized within the field of speech processing.

Introduction and Background

Recent advancements in neural networks have significantly improved speech separation outcomes, particularly with the advent of time-domain audio separation networks (TasNets) and dual-path modeling frameworks. TF-domain dual-path models, which sequentially handle temporal and frequency aspects, have shown marked advantages, especially in reverberant environments where longer Fast Fourier Transform (FFT) windows are beneficial.

Despite these advancements, state-of-the-art (SoTA) TF-domain models have predominantly relied on RNN architectures, which, while effective in capturing local information, suffer from scalability and parallelization limitations during training. In contrast, Transformer-based models offer inherent parallelizability and scalability. However, they traditionally lack intrinsic local modeling capabilities, a gap which TF-Locoformer aims to bridge by incorporating convolutional layers for local feature extraction within a Transformer framework.

Model Architecture

TF-Locoformer introduces key innovations to the standard Transformer architecture to enhance its suitability for TF-domain speech separation. The primary components of this model are described as follows:

ConvSwiGLU Module: This module replaces standard feed-forward network (FFN) layers with convolutional layers that enhance local modeling. The module employs 1D convolution and deconvolution layers along with a swish Gated Linear Unit (SwiGLU) activation function to capture local patterns more effectively.
Macaron-style Architecture: Inspired by the success of the Conformer model, this structure positions FFNs both before and after the self-attention mechanism, thus boosting the model's capacity for local feature extraction.
Normalization Layer: Introducing a novel RMSGroupNorm, this normalization technique divides the feature dimension into groups during normalization, facilitating better disentangling of mixed-source information, thus benefiting speech separation tasks.

The model architecture alternates between temporal and frequency modeling using Transformer blocks enhanced with the ConvSwiGLU module.

Experimental Results

TF-Locoformer was rigorously tested against several datasets including WSJ0-2mix, Libri2Mix, WHAMR!, and the DNS2020 dataset for speech separation and enhancement tasks.

On the WSJ0-2mix dataset, the proposed model achieved superior results compared to various SoTA methods:

TF-Locoformer (Large) attained an SI-SNRi of 24.2 dB and an SDRi of 24.3 dB, outperforming SepTDA $_2$ and other leading models.

Further evaluations on the Libri2Mix and DNS2020 datasets affirmed the model's robustness and scalability:

On Libri2Mix, TF-Locoformer (Medium) recorded an SI-SNRi of 22.1 dB, surpassing MossFormer2.
For the DNS2020 task, TF-Locoformer (Medium) yielded a notable SI-SNR improvement of 23.3 dB, alongside industry-leading STOI and PESQ-WB scores.

The model's efficacy in reverberant conditions was particularly highlighted by its performance on the WHAMR! dataset:

TF-Locoformer (Medium) achieved an SI-SNRi of 18.5 dB, indicating its superior handling of reverberation compared to both RNN-based and other Transformer-based models.

Ablation Studies

The paper conducted multiple ablation experiments to validate the importance of its design choices. ConvSwiGLU and the macaron architecture were shown to significantly enhance model performance, indicating the critical role of effective local modeling. Additionally, the novel RMSGroupNorm consistently provided performance gains over traditional normalization techniques across various model configurations.

Implications and Future Directions

TF-Locoformer sets a new benchmark in TF-domain speech separation by combining the strengths of convolutional and Transformer-based architectures. The model's demonstrated capabilities suggest significant potential for application in real-world scenarios where TF-domain characteristics are prominent.

Moving forward, future research could explore:

Further scaling of TF-Locoformer to leverage larger datasets and more complex separation tasks.
Extending the model's application to domains such as music and general sound separation.
Investigating the model’s potential in low-resource environments and its adaptability to different audio processing contexts.

In summary, TF-Locoformer's introduction represents a significant advancement in the domain of speech separation and enhancement, offering a scalable, parallelizable, and high-performing alternative to traditional RNN-based models.