LAE-ST-MoE: Boosted Language-Aware Encoder Using Speech Translation Auxiliary Task for E2E Code-switching ASR
Abstract: Recently, to mitigate the confusion between different languages in code-switching (CS) automatic speech recognition (ASR), the conditionally factorized models, such as the language-aware encoder (LAE), explicitly disregard the contextual information between different languages. However, this information may be helpful for ASR modeling. To alleviate this issue, we propose the LAE-ST-MoE framework. It incorporates speech translation (ST) tasks into LAE and utilizes ST to learn the contextual information between different languages. It introduces a task-based mixture of expert modules, employing separate feed-forward networks for the ASR and ST tasks. Experimental results on the ASRU 2019 Mandarin-English CS challenge dataset demonstrate that, compared to the LAE-based CTC, the LAE-ST-MoE model achieves a 9.26% mix error reduction on the CS test with the same decoding parameter. Moreover, the well-trained LAE-ST-MoE model can perform ST tasks from CS speech to Mandarin or English text.
- “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in ICML, New York, NY, USA, 2006, ICML ’06, p. 369–376, Association for Computing Machinery.
- Alex Graves, “Sequence Transduction with Recurrent Neural Networks,” arXiv e-prints, p. arXiv:1211.3711, Nov. 2012.
- “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in ICASSP, March 2016, pp. 4960–4964.
- “Joint ctc-attention based end-to-end speech recognition using multi-task learning,” in ICASSP, 2017, pp. 4835–4839.
- “Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition,” in ICASSP, 2018, pp. 5884–5888.
- “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Proc. Interspeech, 2020, pp. 5036–5040.
- “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,” in Proc. Interspeech, 2019, pp. 2613–2617.
- “Improving ctc-based asr models with gated interlayer collaboration,” in ICASSP 2023, 2023, pp. 1–5.
- “PM-MMUT: Boosted Phone-mask Data Augmentation using Multi-Modeling Unit Training for Phonetic-Reduction-Robust E2E Speech Recognition,” in Proc. Interspeech 2022, 2022, pp. 1021–1025.
- “A policy-based approach to the specaugment method for low resource e2e asr,” in 2022 APSIPA ASC, 2022, pp. 630–635.
- “Towards training Bilingual and Code-Switched Speech Recognition models from Monolingual data sources,” arXiv e-prints, p. arXiv:2306.08753, June 2023.
- “Improving Code-Switching and Name Entity Recognition in ASR with Speech Editing based Data Augmentation,” in Proc. INTERSPEECH 2023, 2023, pp. 919–923.
- “Data augmentation for end-to-end code-switching speech recognition,” in 2021 IEEE SLT, 2021, pp. 194–200.
- “Code-switching text generation and injection in mandarin-english asr,” in ICASSP 2023, 2023, pp. 1–5.
- “Optimizing bilingual neural transducer with synthetic code-switching text generation,” arXiv preprint arXiv:2210.12214, 2022.
- “Non-autoregressive mandarin-english code-switching speech recognition,” in 2021 IEEE ASRU, 2021, pp. 465–472.
- “On the End-to-End Solution to Mandarin-English Code-Switching Speech Recognition,” in Proc. Interspeech 2019, 2019, pp. 2165–2169.
- “Minimum word error training for non-autoregressive transformer-based code-switching asr,” in ICASSP 2022. IEEE, 2022, pp. 7807–7811.
- “Reducing language confusion for code-switching speech recognition with token-level language diarization,” in ICASSP 2023, 2023, pp. 1–5.
- “Language-specific Boundary Learning for Improving Mandarin-English Code-switching Speech Recognition,” in Proc. INTERSPEECH 2023, 2023, pp. 3322–3326.
- “Bi-encoder transformer network for mandarin-english code-switching speech recognition using mixture of experts,” in Interspeech, 2020.
- “Language-Routing Mixture of Experts for Multilingual and Code-Switching Speech Recognition,” in Proc. INTERSPEECH 2023, 2023, pp. 1389–1393.
- “SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts,” in Proc. Interspeech 2021, 2021, pp. 2077–2081.
- “Mole : Mixture of language experts for multi-lingual automatic speech recognition,” in ICASSP 2023, 2023, pp. 1–5.
- “Leveraging Phone Mask Training for Phonetic-Reduction-Robust E2E Uyghur Speech Recognition,” in Proc. Interspeech 2021, 2021, pp. 306–310.
- “Monolingual Data Selection Analysis for English-Mandarin Hybrid Code-Switching Speech Recognition,” in Proc. Interspeech 2020, 2020, pp. 2392–2396.
- “Arabic code-switching speech recognition using monolingual data,” arXiv preprint arXiv:2107.01573, 2021.
- “End-to-end code-switching asr for low-resourced language pairs,” in 2019 IEEE ASRU, 2019, pp. 972–979.
- “Towards zero-shot code-switched speech recognition,” in ICASSP 2023, 2023, pp. 1–5.
- “Joint modeling of code-switched and monolingual asr via conditional factorization,” in ICASSP 2022. IEEE, 2022, pp. 6412–6416.
- “Language-specific Characteristic Assistance for Code-switching Speech Recognition,” in Proc. Interspeech 2022, 2022, pp. 3924–3928.
- “LAE: Language-Aware Encoder for Monolingual and Multilingual ASR,” in Proc. Interspeech 2022, 2022, pp. 3178–3182.
- “The asru 2019 mandarin-english code-switching speech recognition challenge: Open datasets, tracks, methods and results,” arXiv preprint arXiv:2007.05916, 2020.
- “Learning to generalize to more: Continuous semantic augmentation for neural machine translation,” in ACL 2022, 2022.
- “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
- “Librispeech: an asr corpus based on public domain audio books,” in 2015 ICASSP. IEEE, 2015, pp. 5206–5210.
- “Specaugment: A simple data augmentation method for automatic speech recognition,” Proc. Interspeech 2019, pp. 2613–2617, 2019.
- “Neural machine translation of rare words with subword units,” arXiv preprint arXiv:1508.07909, 2015.
- “WeNet: Production Oriented Streaming and Non-Streaming End-to-End Speech Recognition Toolkit,” in Proc. Interspeech 2021, 2021, pp. 4054–4058.
- Matt Post, “A call for clarity in reporting BLEU scores,” in Proceedings of the Third Conference on Machine Translation: Research Papers, Belgium, Brussels, Oct. 2018, pp. 186–191, Association for Computational Linguistics.
- “Espnet: End-to-end speech processing toolkit,” arXiv preprint arXiv:1804.00015, 2018.
- “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- Kenneth Heafield, “Kenlm: Faster and smaller language model queries,” in Proceedings of the sixth workshop on statistical machine translation, 2011, pp. 187–197.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.