Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

U2++ MoE: Scaling 4.7x parameters with minimal impact on RTF (2404.16407v2)

Published 25 Apr 2024 in cs.CL and eess.AS
U2++ MoE: Scaling 4.7x parameters with minimal impact on RTF

Abstract: Scale has opened new frontiers in natural language processing, but at a high cost. In response, by learning to only activate a subset of parameters in training and inference, Mixture-of-Experts (MoE) have been proposed as an energy efficient path to even larger and more capable LLMs and this shift towards a new generation of foundation models is gaining momentum, particularly within the field of Automatic Speech Recognition (ASR). Recent works that incorporating MoE into ASR models have complex designs such as routing frames via supplementary embedding network, improving multilingual ability for the experts, and utilizing dedicated auxiliary losses for either expert load balancing or specific language handling. We found that delicate designs are not necessary, while an embarrassingly simple substitution of MoE layers for all Feed-Forward Network (FFN) layers is competent for the ASR task. To be more specific, we benchmark our proposed model on a large scale inner-source dataset (160k hours), the results show that we can scale our baseline Conformer (Dense-225M) to its MoE counterparts (MoE-1B) and achieve Dense-1B level Word Error Rate (WER) while maintaining a Dense-225M level Real Time Factor (RTF). Furthermore, by applying Unified 2-pass framework with bidirectional attention decoders (U2++), we achieve the streaming and non-streaming decoding modes in a single MoE based model, which we call U2++ MoE. We hope that our study can facilitate the research on scaling speech foundation models without sacrificing deployment efficiency.

Simplified Integration of Mixture-of-Experts in ASR Models Achieves High Efficiency with Scaled Performance

Introduction

The evolution of neural network architectures for Automatic Speech Recognition (ASR) has consistently aimed at enhancing model performance while addressing computational and efficiency challenges. The latest innovations have seen the incorporation of Mixture-of-Experts (MoE) to manage the computational demands of scaling models. This research explores a straightforward approach to integrating MoE layers in place of traditional Feed-Forward Network (FFN) layers within both encoder and decoder components of Conformer-based ASR models. A substantial benchmarking dataset, totaling 160,000 hours, demonstrates that such integration not only simplifies the model architecture but also maintains high efficiency without compromising accuracy.

Model Architecture and Methodology

The core architectural component of this model is the Conformer, which is utilized for the encoder, while the transformer architecture is employed for the decoder. Each conventional FFN within these structures is replaced by an MoE layer consisting of multiple expert FFNs governed by a routing mechanism. This adjustment aims to leverage the sparsity for computational saving while keeping the model capacity high. The U2++ framework, known for its dual capability in streaming and non-streaming setups, underpins the proposed model, permitting dynamic adjustment of training strategies to align with either mode effectively.

  • Encoder and Decoder Modification: All FFN layers are substituted with MoE layers, which consist of a routing network and several expert networks.
  • Training Losses: Utilizes a combined loss function comprising Connectionist Temporal Classification (CTC) and Autoregressive Encoder Decoder (AED) losses without any auxiliary losses for load balancing or expert routing.
  • Dynamic Chunk Masking: For streaming capabilities, a dynamic chunk masking strategy is employed allowing the model to handle variable chunk sizes, facilitating both streaming and non-streaming functionalities seamlessly.

Experimental Setup and Results

The experiments were conducted using a large-scale dataset predominantly in Mandarin, with a minor portion in English. The results were benchmarked against Dense-225M and Dense-1B models for comparative analysis. Impressively, the MoE-1B model achieved comparable Word Error Rate (WER) to the Dense-1B model, while preserving the real-time operation efficiency of the Dense-225M setup under similar computational conditions.

  • WER and Model Efficiency: The MoE-1B model demonstrates a comparable WER to the Dense-1B model but exhibits significantly more computational efficiency, thus aligning the benefits of scaled performance with practical deployability.
  • Inference Efficiency: In terms of Real-Time Factor (RTF), the MoE-1B model essentially matches the Dense-225M model despite having a parameter count closer to the Dense-1B model, highlighting the efficiency of MoE integrations.

Discussion on Streaming Abilities

A noteworthy aspect of this work is extending MoE integration to support streaming capabilities, a challenge often encountered with large-scale models. By employing a two-stage training approach that first establishes a robust non-streaming base before transitioning to a streaming-compatible configuration, the U2++ MoE successfully supports real-time ASR processing demands without degrading performance.

Future Implications and Developments

This research lays foundational work for further exploration into simple yet effective scaling strategies for ASR systems, particularly in how MoE layers can be utilized across different neural network architectures beyond Conformers. The findings encourage the pursuit of MoE models that prioritize not just performance but also operational efficiency and flexibility across different deployment scenarios, possibly extending beyond speech recognition into other domains of AI that require large-scale modeling capabilities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
  1. “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,” in 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. 2017, OpenReview.net.
  2. “Gshard: Scaling giant models with conditional computation and automatic sharding,” in 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. 2021, OpenReview.net.
  3. “Mixture-of-expert conformer for streaming multilingual ASR,” CoRR, vol. abs/2305.15663, 2023.
  4. “Speechmoe: Scaling to large acoustic models with dynamic routing mixture of experts,” in Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021, Hynek Hermansky, Honza Cernocký, Lukás Burget, Lori Lamel, Odette Scharenborg, and Petr Motlícek, Eds. 2021, pp. 2077–2081, ISCA.
  5. “Language-routing mixture of experts for multilingual and code-switching speech recognition,” CoRR, vol. abs/2307.05956, 2023.
  6. “3m: Multi-loss, multi-path and multi-level neural networks for speech recognition,” in 13th International Symposium on Chinese Spoken Language Processing, ISCSLP 2022, Singapore, December 11-14, 2022, Kong Aik Lee, Hung-yi Lee, Yanfeng Lu, and Minghui Dong, Eds. 2022, pp. 170–174, IEEE.
  7. “Wenet 2.0: More productive end-to-end speech recognition toolkit,” CoRR, vol. abs/2203.15455, 2022.
  8. “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, pp. 9, 2019.
  9. “Language models are few-shot learners,” in Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, Eds., 2020.
  10. “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, Eds. 2023, vol. 202 of Proceedings of Machine Learning Research, pp. 28492–28518, PMLR.
  11. “Google USM: scaling automatic speech recognition beyond 100 languages,” CoRR, vol. abs/2303.01037, 2023.
  12. “Beyond distillation: Task-level mixture-of-experts for efficient inference,” in Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, Eds. 2021, pp. 3577–3599, Association for Computational Linguistics.
  13. “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,” J. Mach. Learn. Res., vol. 23, pp. 120:1–120:39, 2022.
  14. “An unsupervised autoregressive model for speech representation learning,” in 20th Annual Conference of the International Speech Communication Association (Interspeech 2019), Gernot Kubin and Zdravko Kacic, Eds., Graz, Austria, 2019, pp. 146–150, ISCA.
  15. “Conformer: Convolution-augmented transformer for speech recognition,” in Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020, Helen Meng, Bo Xu, and Thomas Fang Zheng, Eds. 2020, pp. 5036–5040, ISCA.
  16. “Attention is all you need,” in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems (NeurIPS 2017), Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, Eds., Long Beach, USA, 2017, pp. 5998–6008, ACM.
  17. “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in 23rd International Conference on Machine Learning (ICML 2006), William W. Cohen and Andrew W. Moore, Eds., Pittsburgh, USA, 2006, pp. 369–376, ACM.
  18. “Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit,” in Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021, Hynek Hermansky, Honza Cernocký, Lukás Burget, Lori Lamel, Odette Scharenborg, and Petr Motlícek, Eds. 2021, pp. 4054–4058, ISCA.
  19. “Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters,” in KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, Rajesh Gupta, Yan Liu, Jiliang Tang, and B. Aditya Prakash, Eds. 2020, pp. 3505–3506, ACM.
  20. “Scaling laws for neural language models,” CoRR, vol. abs/2001.08361, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Xingchen Song (18 papers)
  2. Di Wu (477 papers)
  3. Binbin Zhang (46 papers)
  4. Dinghao Zhou (7 papers)
  5. Zhendong Peng (20 papers)
  6. Bo Dang (16 papers)
  7. Fuping Pan (11 papers)
  8. Chao Yang (333 papers)
Citations (5)