Investigating Zero-Shot Generalizability on Mandarin-English Code-Switched ASR and Speech-to-text Translation of Recent Foundation Models with Self-Supervision and Weak Supervision (2401.00273v1)
Abstract: This work evaluated several cutting-edge large-scale foundation models based on self-supervision or weak supervision, including SeamlessM4T, SeamlessM4T v2, and Whisper-large-v3, on three code-switched corpora. We found that self-supervised models can achieve performances close to the supervised model, indicating the effectiveness of multilingual self-supervised pre-training. We also observed that these models still have room for improvement as they kept making similar mistakes and had unsatisfactory performances on modeling intra-sentential code-switching. In addition, the validity of several variants of Whisper was explored, and we concluded that they remained effective in a code-switching scenario, and similar techniques for self-supervised models are worth studying to boost the performance of code-switched tasks.
- “Improving code-switched ASR with linguistic information,” in Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, Oct. 2022, pp. 7171–7176, International Committee on Computational Linguistics.
- Hexin Liu et al., “Reducing language confusion for code-switching speech recognition with token-level language diarization,” in ICASSP 2023.
- “End-to-end speech translation for code switched speech,” in Findings of ACL, Dublin, Ireland, May 2022, pp. 1435–1448, Association for Computational Linguistics.
- “wav2vec 2.0: a framework for self-supervised learning of speech representations,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020, pp. 12449–12460.
- Wei-Ning Hsu et al., “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” TASLP, vol. 29, pp. 3451–3460, 2021.
- “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
- “GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio,” in Proc. Interspeech 2021, 2021, pp. 3670–3674.
- “Weakly Supervised Construction of ASR Systems from Massive Video Data,” in Proc. Interspeech 2021, 2021, pp. 4533–4537.
- “SpeechNet: Weakly supervised, end-to-end speech recognition at industrial scale,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track, Yunyao Li and Angeliki Lazaridou, Eds., Abu Dhabi, UAE, Dec. 2022, pp. 285–293, Association for Computational Linguistics.
- Seamless Communication et al., “Seamlessm4t: Massively multilingual & multimodal machine translation,” 2023.
- Seamless Communication et al., “Seamless: Multilingual expressive and streaming speech translation,” 2023.
- “Scaling speech technology to 1,000+ languages,” arXiv, 2023.
- “Robust speech recognition via large-scale weak supervision,” in Proceedings of the 40th International Conference on Machine Learning, Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, Eds. 23–29 Jul 2023, vol. 202 of Proceedings of Machine Learning Research, pp. 28492–28518, PMLR.
- “Zero-shot domain-sensitive speech recognition with prompt-conditioning fine-tuning,” 2023.
- “Can whisper perform speech-based in-context learning,” 2023.
- “Adapting the adapters for code-switching in multilingual asr,” 2023.
- “w2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021, pp. 244–250.
- “Self-supervised learning with random-projection quantizer for speech recognition,” in Proceedings of the 39th International Conference on Machine Learning, Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, Eds. 17–23 Jul 2022, vol. 162 of Proceedings of Machine Learning Research, pp. 3915–3924, PMLR.
- “Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization,” in Proc. INTERSPEECH 2023, 2023, pp. 396–400.
- “Ascend: A spontaneous chinese-english dataset for code-switching in multi-turn conversation,” in Proceedings of the 13th Language Resources and Evaluation Conference (LREC), 2022.
- “Zero resource code-switched speech benchmark using speech utterance pairs for multiple spoken languages,” arXiv preprint arXiv:2310.03018, 2023.
- “An exploration of in-context learning for speech language model,” 2023.
- “Dynamic-superb: Towards a dynamic, collaborative, and comprehensive instruction-tuning benchmark for speech,” 2023.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.