Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Enhancing Code-Switching Speech Recognition with LID-Based Collaborative Mixture of Experts Model (2409.02050v2)

Published 3 Sep 2024 in cs.CL, cs.SD, and eess.AS

Abstract: Due to the inherent difficulty in modeling phonetic similarities across different languages, code-switching speech recognition presents a formidable challenge. This study proposes a Collaborative-MoE, a Mixture of Experts (MoE) model that leverages a collaborative mechanism among expert groups. Initially, a preceding routing network explicitly learns Language Identification (LID) tasks and selects experts based on acquired LID weights. This process ensures robust routing information to the MoE layer, mitigating interference from diverse language domains on expert network parameter updates. The LID weights are also employed to facilitate inter-group collaboration, enabling the integration of language-specific representations. Furthermore, within each language expert group, a gating network operates unsupervised to foster collaboration on attributes beyond language. Extensive experiments demonstrate the efficacy of our approach, achieving significant performance enhancements compared to alternative methods. Importantly, our method preserves the efficient inference capabilities characteristic of MoE models without necessitating additional pre-training.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning (ICML), 2006, pp. 369–376.
  2. “Attention-based models for speech recognition,” Advances in neural information processing systems (NeurIPS), vol. 28, 2015.
  3. “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 4960–4964.
  4. “Joint ctc-attention based end-to-end speech recognition using multi-task learning,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 4835–4839.
  5. “Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss,” in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7829–7833.
  6. “Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5884–5888.
  7. “Linguistically Motivated Parallel Data Augmentation for Code-Switch Language Modeling,” in Proc. INTERSPEECH 2019, 2019, pp. 3730–3734.
  8. “Code-Switching Sentence Generation by Bert and Generative Adversarial Networks,” in Proc. INTERSPEECH 2019, 2019, pp. 3525–3529.
  9. “Comparison of data augmentation and adaptation strategies for code-switched automatic speech recognition,” in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6081–6085.
  10. “Semi-Supervised Acoustic Model Training for Five-Lingual Code-Switched ASR,” in Proc. INTERSPEECH 2019, 2019, pp. 3745–3749.
  11. “Semi-supervised acoustic model training for speech with code-switching,” Speech Communication, vol. 105, pp. 12–22, 2018.
  12. “Study of Semi-supervised Approaches to Improving English-Mandarin Code-Switching Speech Recognition,” in Proc. INTERSPEECH 2018, 2018, pp. 1928–1932.
  13. “Multilingual Neural Network Acoustic Modelling for ASR of Under-Resourced English-isiZulu Code-Switched Speech,” in Proc. INTERSPEECH 2018, 2018, pp. 2603–2607.
  14. “Towards code-switching asr for end-to-end ctc models,” in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6076–6080.
  15. “Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model,” in Proc. INTERSPEECH 2019, 2019, pp. 2130–2134.
  16. “Massively Multilingual ASR: 50 Languages, 1 Model, 1 Billion Parameters,” in Proc. Interspeech 2020, 2020, pp. 4751–4755.
  17. “Towards zero-shot code-switched speech recognition,” in 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  18. “A survey of code-switched speech and language processing,” arXiv preprint arXiv:1904.00784, 2019.
  19. “Bi-Encoder Transformer Network for Mandarin-English Code-Switching Speech Recognition Using Mixture of Experts,” in Proc. INTERSPEECH 2020, 2020, pp. 4766–4770.
  20. “LAE: Language-Aware Encoder for Monolingual and Multilingual ASR,” in Proc. INTERSPEECH 2022, 2022, pp. 3178–3182.
  21. “Language-specific Characteristic Assistance for Code-switching Speech Recognition,” in Proc. INTERSPEECH 2022, 2022, pp. 3924–3928.
  22. “Language-Routing Mixture of Experts for Multilingual and Code-Switching Speech Recognition,” in Proc. INTERSPEECH 2023, 2023, pp. 1389–1393.
  23. “Improving End-to-End Modeling For Mandarin-English Code-Switching Using Lightweight Switch-Routing Mixture-of-Experts,” in Proc. INTERSPEECH 2023, 2023, pp. 4224–4228.
  24. “Mixture-of-Expert Conformer for Streaming Multilingual ASR,” in Proc. INTERSPEECH 2023, 2023, pp. 3327–3331.
  25. “Mixture of informed experts for multilingual speech recognition,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6234–6238.
  26. “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,” The Journal of Machine Learning Research (JMLR), vol. 23, no. 1, pp. 5232–5270, 2022.
  27. “Adaptive mixtures of local experts,” Neural computation, vol. 3, no. 1, pp. 79–87, 1991.
  28. “Conformer: Convolution-augmented Tranßsformer for Speech Recognition,” in Proc. INTERSPEECH 2020, 2020, pp. 5036–5040.
  29. “The asru 2019 mandarin-english code-switching speech recognition challenge: Open datasets, tracks, methods and results,” arXiv preprint arXiv:2007.05916, 2020.
  30. “Librispeech: An asr corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5206–5210.
  31. “WeNet 2.0: More Productive End-to-End Speech Recognition Toolkit,” in Proc. INTERSPEECH 2022, 2022, pp. 1661–1665.
  32. “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,” in Proc. INTERSPEECH 2019, 2019, pp. 2613–2617.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Hukai Huang (8 papers)
  2. Jiayan Lin (2 papers)
  3. Kaidi Wang (19 papers)
  4. Yishuang Li (3 papers)
  5. Wenhao Guan (13 papers)
  6. Qingyang Hong (29 papers)
  7. Lin Li (329 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com