Conformer: Convolution-augmented Transformer for Speech Recognition (2005.08100v1)

Published 16 May 2020 in eess.AS, cs.LG, and cs.SD

Abstract: Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs). Transformer models are good at capturing content-based global interactions, while CNNs exploit local features effectively. In this work, we achieve the best of both worlds by studying how to combine convolution neural networks and transformers to model both local and global dependencies of an audio sequence in a parameter-efficient way. To this regard, we propose the convolution-augmented transformer for speech recognition, named Conformer. Conformer significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies. On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3% without using a LLM and 1.9%/3.9% with an external LLM on test/testother. We also observe competitive performance of 2.7%/6.3% with a small model of only 10M parameters.

PDF Abstract

Conformer: Convolution-augmented Transformer for Speech Recognition

The research paper entitled "Conformer: Convolution-augmented Transformer for Speech Recognition," authored by Anmol Gulati et al., presents a novel architecture aimed at enhancing automatic speech recognition (ASR) systems' efficiency and accuracy. The Conformer model synergistically combines the strengths of Convolutional Neural Networks (CNNs) and Transformers, facilitating the capture of both local and global dependencies in audio sequences.

Introduction

The paper references the marked improvements in ASR attributed to neural networks, specifically Recurrent Neural Networks (RNNs), CNNs, and Transformers. While RNNs have traditionally been effective at modeling temporal dependencies, Transformers have gained popularity due to their ability to capture long-distance dependencies and their high training efficiency. Conversely, CNNs have been adept at learning local features progressively.

However, the isolated use of Transformers or CNNs presents limitations—Transformers can struggle with local feature extraction, whereas CNNs might require numerous layers to model global contexts effectively. This paper proposes the Conformer model to amalgamate the benefits of both architectures, thereby achieving parameter-efficient modeling of audio sequences.

Model Architecture

The Conformer is a convolution-augmented Transformer integrating convolutions with self-attention in a novel configuration:

Multi-headed Self-Attention (MHSA): Utilizing relative sinusoidal positional encoding as per Transformer-XL, the MHSA module generalizes better for varying input lengths.
Convolution Module: Following the MHSA, this module includes a pointwise convolution with GLU activation, a 1-D depthwise convolution, followed by Batchnorm and Swish activation layers.
Feed Forward Modules: The architecture uniquely sandwiches the MHSA and convolution modules between two macaron-like feed forward modules using half-step residual connections.

This combination ensures efficient learning of both local and global features, as evidenced by the notable architecture diagram in the paper.

Experimental Results

The Conformer model is evaluated on the LibriSpeech dataset, demonstrating substantial improvements over existing models:

State-of-the-art Performance: The large Conformer model (118.8M parameters) achieves a WER of 2.1%/4.3% without a LLM and 1.9%/3.9% with an external LLM on the test/test-other splits. This performance surpasses previously best Transformer Transducer models.
Parameter Efficiency: Even the smaller Conformer model (10.3M parameters) outperforms similar-sized models like ContextNet(S), showing WERs of 2.7%/6.3% on test/test-other splits without a LLM.

Ablation Studies

The authors conducted comprehensive ablation studies to disentangle the importance of individual components within the Conformer:

Convolution Modules: When the convolution block is removed or replaced with lightweight convolutions, performance notably degrades, underscoring its critical role.
Macaron-style FFNs: Unlike a single FFN, the macaron-like FFN pair with half-step residuals provides a significant performance boost.
Attention Heads and Kernel Sizes: Increasing attention heads up to 16 and optimizing convolution kernel sizes (with an optimal size of 32) contribute positively to performance metrics.

Practical and Theoretical Implications

The integration of convolution modules within the Transformer framework offers a balanced trade-off between capturing local and global dependencies, thus fostering more efficient and accurate ASR systems. Practically, this means enhanced speech recognition capabilities in various real-world applications—from virtual assistants to transcription services.

Theoretically, the Conformer architecture paves the way for further explorations into hybrid models combining different neural network components. Future research may involve scaling and refining this model, optimizing components further for even more parameter-efficient designs.

In summary, the Conformer model represents a significant advancement in ASR development, amalgamating the strengths of CNNs and Transformers in a singular, robust architecture. It effectively sets new performance benchmarks on standard datasets, underscoring the viability of hybrid approaches in neural network design.