E-Branchformer: Branchformer with Enhanced merging for speech recognition (2210.00077v2)

Published 30 Sep 2022 in eess.AS and cs.LG

Abstract: Conformer, combining convolution and self-attention sequentially to capture both local and global information, has shown remarkable performance and is currently regarded as the state-of-the-art for automatic speech recognition (ASR). Several other studies have explored integrating convolution and self-attention but they have not managed to match Conformer's performance. The recently introduced Branchformer achieves comparable performance to Conformer by using dedicated branches of convolution and self-attention and merging local and global context from each branch. In this paper, we propose E-Branchformer, which enhances Branchformer by applying an effective merging method and stacking additional point-wise modules. E-Branchformer sets new state-of-the-art word error rates (WERs) 1.81% and 3.65% on LibriSpeech test-clean and test-other sets without using any external training data.

Authors (7)

Kwangyoun Kim (18 papers)
Felix Wu (30 papers)
Yifan Peng (147 papers)
Jing Pan (25 papers)
Prashant Sridhar (10 papers)
Kyu J. Han (17 papers)
Shinji Watanabe (416 papers)

Citations (91)

View on Semantic Scholar

Summary

E-Branchformer: Branchformer with Enhanced Merging for Speech Recognition

The development of E-Branchformer marks a significant contribution to the field of automatic speech recognition (ASR) by enhancing the merging techniques of the existing Branchformer architecture. The paper presents E-Branchformer as a successor to the Branchformer, integrating an effective merging method and additional point-wise modules, achieving state-of-the-art performance on the LibriSpeech dataset without relying on external training data.

Background and Motivation

The Conformer model, which sequentially integrates convolution and self-attention to capture both local and global information, has been recognized for its superior performance in ASR tasks. However, the integration of these components has challenged other models to match its efficacy. Branchformer introduced a method that parallels convolution and self-attention, merging the contexts via separate branches. However, E-Branchformer advances this approach further by optimizing the merging process to enhance ASR performance.

E-Branchformer Architecture

The architecture of E-Branchformer is built upon the strengths of combining local and global information. The key advancements include:

Enhanced Merging Module: Instead of simple concatenation, E-Branchformer applies a depth-wise convolution to merge local and global branch outputs, thereby incorporating adjacent features more effectively.
Experimental Study of Merge Techniques: Several techniques for merging, including multiple kernel convolutions and squeeze-and-excitation (SE) blocks, have been explored. The findings suggest that a depth-wise convolution with an optimal kernel size significantly enhances accuracy.
Integration with Point-wise Feed-Forward Networks (FFNs): The model revisits the use of FFNs, integrating them alongside the Branchformer to improve capacity and performance. This is especially effective when employing a macaron-style FFN with a reduced intermediate dimension.

Experimental Evaluation

E-Branchformer has been rigorously evaluated on the LibriSpeech dataset. In large configurations without external LLMs, it surpasses previous benchmarks with word error rates (WERs) of 1.85% on the test-clean set and 3.71% on the test-other set. This architecture also leverages Internal LLM Estimation (ILME) to enhance shallow fusion, culminating in a new state-of-the-art performance on both test sets.

Implications and Future Directions

The superior performance of E-Branchformer highlights its potential for broader applications in ASR systems, suggesting possible adaptation in Transducer models and self-supervised learning frameworks. Future research may explore the application of E-Branchformer to other speech domains, including speech enhancement and spoken language understanding. This work opens avenues for optimizing ASR models with deeper integration of convolutional and attention mechanisms in hybrid configurations.

Conclusion

E-Branchformer advances the field of ASR by demonstrating an effective merging mechanism that capitalizes on both local and global contexts through convolution and self-attention integration. This research not only presents clear empirical advancements but also offers insightful directions for developing future ASR architectures.

PDF Markdown

Related Papers

GitHub

GitHub - espnet/espnet: End-to-End Speech Processing Toolkit (7,999 stars)

Tweets

https://twitter.com/chiragpatel39/status/1581177566995812357

https://twitter.com/faroit/status/1630522288645763072

https://twitter.com/Sid_Arora_18/status/1587940988848246784

https://twitter.com/chenwanch1/status/1668594356318765067

https://twitter.com/dl_weekly/status/1753954525998723421

https://twitter.com/World_Archived/status/1558166618752126977

YouTube

Show All Videos