ML-SUPERB: Multilingual Speech Universal PERformance Benchmark (2305.10615v2)
Abstract: Speech processing Universal PERformance Benchmark (SUPERB) is a leaderboard to benchmark the performance of Self-Supervised Learning (SSL) models on various speech processing tasks. However, SUPERB largely considers English speech in its evaluation. This paper presents multilingual SUPERB (ML-SUPERB), covering 143 languages (ranging from high-resource to endangered), and considering both automatic speech recognition and language identification. Following the concept of SUPERB, ML-SUPERB utilizes frozen SSL features and employs a simple framework for multilingual tasks by learning a shallow downstream model. Similar to the SUPERB benchmark, we find speech SSL models can significantly improve performance compared to FBANK features. Furthermore, we find that multilingual models do not always perform better than their monolingual counterparts. We will release ML-SUPERB as a challenge with organized datasets and reproducible training scripts for future multilingual representation research.
- ``Self-supervised speech representation learning: A review'' In JSTSP, 2022
- ``SUPERB: Speech Processing Universal PERformance Benchmark'' In Proc. Interspeech, 2021, pp. 1194–1198 DOI: 10.21437/Interspeech.2021-1775
- ``wav2vec 2.0: A framework for self-supervised learning of speech representations'' In Proc. NeurIPS 33, 2020, pp. 12449–12460
- ``HuBERT: Self-supervised speech representation learning by masked prediction of hidden units'' In TASLP 29, 2021, pp. 3451–3460
- ``SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities'' In Proc. ACL, 2022, pp. 8479–8492
- ``XLS-R: Self-supervised cross-lingual speech representation learning at scale'' In arXiv preprint arXiv:2111.09296, 2021
- ``Unsupervised cross-lingual representation learning for speech recognition'' In arXiv preprint arXiv:2006.13979, 2020
- ``SpeechMatrix: A Large-Scale Mined Corpus of Multilingual Speech-to-Speech Translations'' In arXiv preprint arXiv:2211.04508, 2022
- ``Improving Automatic Speech Recognition Performance for Low-Resource Languages With Self-Supervised Models'' In JSTSP 16.6, 2022, pp. 1227–1241
- Dan Berrebbi, Jiatong Shi and Brian Yan ``Combining Spectral and Self-Supervised Features for Low Resource Speech Recognition and Translation'' In Proc. Interspeech, 2022, pp. 3533–3537 DOI: 10.21437/Interspeech.2022-10796
- ``Self-Supervised Representations Improve End-to-End Speech Translation'' In Proc. Interspeech 2020, 2020, pp. 1491–1495
- ``ASR2K: Speech Recognition for Around 2000 Languages without Audio'' In Proc. Interspeech, 2022, pp. 4885–4889 DOI: 10.21437/Interspeech.2022-10712
- `` LeBenchmark: A Reproducible Framework for Assessing Self-Supervised Representation Learning from Speech'' In Proc. Interspeech, 2021, pp. 1439–1443 DOI: 10.21437/Interspeech.2021-556
- ``IndicSUPERB: A Speech Processing Universal Performance Benchmark for Indian languages'' In arXiv preprint arXiv:2208.11761, 2022
- ``XTREME-S: Evaluating Cross-lingual Speech Representations'' In Proc. Interspeech, 2022, pp. 3248–3252 DOI: 10.21437/Interspeech.2022-10007
- ``MLS: A Large-Scale Multilingual Dataset for Speech Research'' In Proc. Interspeech 2020, 2020, pp. 2757–2761
- ``Common Voice: A Massively-Multilingual Speech Corpus'' In Proc. LREC, 2020, pp. 4218–4222
- Ken MacLean ``Voxforge'' In Ken MacLean.[Online]. Available: http://www. voxforge. org/home.[Accessed by 2022], 2018
- ``VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation'' In Proc. ACL, 2021, pp. 993–1003
- ``A Step-by-Step Process for Building TTS Voices Using Open Source Data and Framework for Bangla, Javanese, Khmer, Nepali, Sinhala, and Sundanese'' In Proc. SLTU, 2018, pp. 66–70
- ``Open-Source High Quality Speech Datasets for Basque, Catalan and Galician'' In Proc. SLTU, 2020, pp. 21–27
- ``Open-source Multi-speaker Speech Corpora for Building Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu Speech Synthesis Systems'' In Proc. LREC, 2020, pp. 6494–6503
- ``Language Technology Support for Norwegian'' In The Norwegian Language in the Digital Age: Bokmalsversjon, 2012, pp. 52–70
- ``Fleurs: Few-shot learning evaluation of universal representations of speech'' In Proc. SLT, 2023, pp. 798–805
- ``The NCHLT speech corpus of the South African languages'', 2014
- Timo Baumann, Arne Köhn and Felix Hennig ``The Spoken Wikipedia Corpus collection: Harvesting, alignment and an application to hyperlistening'' In LREC 53, 2019, pp. 303–329
- Jiatong Shi ``Leveraging End-to-End ASR for Endangered Language Documentation: An Empirical Study on Yolóxochitl Mixtec'' In Proc. ACL, 2021, pp. 1134–1145
- ``Highland Puebla Nahuatl speech translation corpus for endangered language documentation'' In Proc. AmericaNLP, 2021, pp. 53–63
- Imdat Solak ``M-AILab Speech dataset'' In Imdat Solak.[Online]. Available: https://www.caito.de/2019/01/03/the-m-ailabs-speech-dataset/.[Accessed by 2022], 2018
- ``All Together Now: The Living Audio Dataset.'' In INTERSPEECH, 2019, pp. 1521–1525
- ``A smartphone-based ASR data collection tool for under-resourced languages'' In Speech communication 56, 2014, pp. 119–131
- Shinji Watanabe, Takaaki Hori and John R Hershey ``Language independent end-to-end architecture for joint language identification and speech recognition'' In Proc. ASRU, 2017, pp. 265–271
- ``Large-Scale End-to-End Multilingual Speech Recognition and Language Identification with Multi-Task Learning'' In Proc. Interspeech, 2020, pp. 1037–1041 DOI: 10.21437/Interspeech.2020-2164
- ``Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification'' In Proc. Interspeech, 2022, pp. 3223–3227 DOI: 10.21437/Interspeech.2022-11249
- ``Improving Massively Multilingual ASR With Auxiliary CTC Objectives'' In Proc. ICASSP 2023, 2023
- ``Transformers: State-of-the-Art Natural Language Processing'' In Proc. EMNLP, 2020, pp. 38–45 DOI: 10.18653/v1/2020.emnlp-demos.6
- ``ESPnet: End-to-End Speech Processing Toolkit'' In Proc. Interspeech, 2018, pp. 2207–2211 DOI: 10.21437/Interspeech.2018-1456
- ``SUPERB@ SLT 2022: Challenge on Generalization and Efficiency of Self-Supervised Speech Representation Learning'' In Proc. SLT, 2023, pp. 1096–1103
- ``An exploration of self-supervised pretrained representations for end-to-end speech recognition'' In Proc. ASRU, 2021, pp. 228–235
- ``WavLM: Large-scale self-supervised pre-training for full stack speech processing'' In JSTSP 16.6, 2022, pp. 1505–1518
- ``Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training'' In Proc. Interspeech, 2021, pp. 721–725 DOI: 10.21437/Interspeech.2021-236
- ``Chinese speech pretraining'', 2023 URL: https://github.com/TencentGameMate/chinese_speech_pretrain
- ``Textless Speech-to-Speech Translation on Real Data'' In Proc. NAACL, 2022, pp. 860–872
- Ankita Pasad, Bowen Shi and Karen Livescu ``Comparative layer-wise analysis of self-supervised speech models'' In Proc. ICASSP 2023, 2022
- Dan Berrebbi, Brian Yan and Shinji Watanabe ``Avoid Overthinking in Self-Supervised Models for Speech Recognition'' In Proc. ICASSP 2023, 2022
- Jiatong Shi (82 papers)
- Dan Berrebbi (10 papers)
- William Chen (49 papers)
- Ho-Lam Chung (13 papers)
- En-Pei Hu (5 papers)
- Wei Ping Huang (1 paper)
- Xuankai Chang (61 papers)
- Shang-Wen Li (55 papers)
- Abdelrahman Mohamed (59 papers)
- Hung-yi Lee (327 papers)
- Shinji Watanabe (416 papers)