A Comparative Study on Transformer vs RNN in Speech Applications (1909.06317v2)

Published 13 Sep 2019 in cs.CL, cs.SD, and eess.AS

Abstract: Sequence-to-sequence models have been widely used in end-to-end speech processing, for example, automatic speech recognition (ASR), speech translation (ST), and text-to-speech (TTS). This paper focuses on an emergent sequence-to-sequence model called Transformer, which achieves state-of-the-art performance in neural machine translation and other natural language processing applications. We undertook intensive studies in which we experimentally compared and analyzed Transformer and conventional recurrent neural networks (RNN) in a total of 15 ASR, one multilingual ASR, one ST, and two TTS benchmarks. Our experiments revealed various training tips and significant performance benefits obtained with Transformer for each task including the surprising superiority of Transformer in 13/15 ASR benchmarks in comparison with RNN. We are preparing to release Kaldi-style reproducible recipes using open source and publicly available datasets for all the ASR, ST, and TTS tasks for the community to succeed our exciting outcomes.

PDF Abstract

A Comparative Study on Transformer vs RNN in Speech Applications

This paper, authored by Shigeki Karita et al., systematically examines the performance of Transformer architectures relative to Recurrent Neural Networks (RNNs) across various speech processing tasks, namely automatic speech recognition (ASR), speech translation (ST), and text-to-speech (TTS). The authors conduct rigorous experimental investigations over multiple datasets to ascertain the efficacy of Transformer models vis-à-vis traditional RNN-based approaches.

Methodology and Datasets

The paper evaluates both architectures on distinct benchmarks: 15 ASR datasets covering diverse languages and complexities, one multilingual ASR setup, one ST dataset, and two TTS datasets. This comprehensive comparison is executed using Kaldi-style reproducible recipes, ensuring transparency and replicability. The experiments are performed using public datasets, further enhancing their utility for community-wide validation and exploration.

Key Observations and Numerical Results

The empirical findings demonstrate a clear performance advantage of Transformers in speech-related tasks, particularly ASR. In the ASR context, Transformers outperform RNNs on 13 out of 15 benchmark datasets, indicating a notable supremacy in accuracy. For instance, on popular datasets like LibriSpeech, the Transformer model achieves a word error rate (WER) of 2.2% on the dev-clean set, surpassing previous results from RNN-based as well as HMM-DNN-based methods.

Similarly, Transformers exhibit superior performance in multilingual ASR settings, improving the character error rates across multiple languages when compared to both baseline and advanced RNN configurations. In the ST task evaluated on the Fisher-CALLHOME corpus, Transformers achieve a BLEU score improvement over RNNs, reflecting their potential in translation applications. Although TTS results show comparable performance between the two architectures, Transformers still display advantages in certain configurations.

Technical Insights and Practical Implications

The paper unearths several training optimizations critical for leveraging Transformer models in speech tasks. These tips include adjusting minibatch sizes to mitigate underfitting and employing optimized dropout rates to prevent overfitting. Moreover, the authors point out that increasing minibatch sizes accelerates training without sacrificing accuracy, thereby enhancing the practical viability of deploying Transformers in resource-intensive environments.

One limitation noted is the slower decoding time associated with Transformers, attributed to their computational complexity. This aspect necessitates the development of more efficient decoding algorithms to match the runtime performance of RNNs and HMM-based systems.

Future Directions

This research invites further exploration into optimizing Transformers for TTS, where training speed and output quality need closer attention. Additionally, the integration of robust ASR techniques such as data augmentation and speech enhancement into TTS workflows could bridge existing gaps further.

Conclusion

The significant results and insights presented in this paper underscore the transformative potential of Transformers in advancing speech technology landscapes. The detailed comparative analysis and shared resources serve as a foundation for future research, fostering advancements in neural architectures for diverse speech applications. The paper concludes with an invitation to the research community to build upon these findings, potentially ushering in new efficiency paradigms for large-scale, multilingual, and versatile speech processing solutions.

PDF Markdown Bookmark Chat (Pro)

Authors (13)

Shigeki Karita (15 papers)
Nanxin Chen (30 papers)
Tomoki Hayashi (42 papers)
Takaaki Hori (41 papers)
Hirofumi Inaguma (42 papers)
Ziyan Jiang (16 papers)
Masao Someki (7 papers)
Nelson Enrique Yalta Soplin (3 papers)
Ryuichi Yamamoto (34 papers)
Xiaofei Wang (138 papers)
Shinji Watanabe (416 papers)
Takenori Yoshimura (6 papers)
Wangyou Zhang (35 papers)

Citations (689)

View on Semantic Scholar