2000 character limit reached
Africa-Centric Self-Supervised Pre-Training for Multilingual Speech Representation in a Sub-Saharan Context (2404.02000v3)
Published 2 Apr 2024 in cs.CL, cs.LG, cs.SD, and eess.AS
Abstract: We present the first self-supervised multilingual speech model trained exclusively on African speech. The model learned from nearly 60 000 hours of unlabeled speech segments in 21 languages and dialects spoken in sub-Saharan Africa. On the SSA subset of the FLEURS-102 dataset, our approach based on a HuBERT$_{base}$ (0.09B) architecture shows competitive results, for ASR downstream task, compared to the w2v-bert-51 (0.6B) pre-trained model proposed in the FLEURS benchmark, while being more efficient by using 7x less data and 6x less parameters. Furthermore, in the context of a LID downstream task, our approach outperforms FLEURS baselines accuracy by over 22\%.
- Serengeti: Massively multilingual language models for africa. arXiv preprint arXiv:2212.10785, 2022.
- MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 4488–4508, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.298. URL https://aclanthology.org/2022.emnlp-main.298.
- XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale. In Proc. Interspeech 2022, pp. 2278–2282, 2022. doi: 10.21437/Interspeech.2022-143.
- Speech resources in the tamasheq language. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 2066–2071, 2022.
- Hervé Bredin. pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe. In Proc. Interspeech 2023, 2023.
- w2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 244–250, 2021. doi: 10.1109/ASRU51503.2021.9688253.
- Africa as a phonological area, pp. 36–85. Cambridge Approaches to Language Contact. Cambridge University Press, 2007.
- Unsupervised Cross-Lingual Representation Learning for Speech Recognition. In Proc. Interspeech 2021, pp. 2426–2430, 2021. doi: 10.21437/Interspeech.2021-329.
- Fleurs: Few-shot learning evaluation of universal representations of speech. In 2022 IEEE Spoken Language Technology Workshop (SLT), pp. 798–805, 2022. doi: 10.1109/SLT54892.2023.10023141.
- Afrolm: A self-active learning-based multilingual pretrained language model for 23 african languages. In Proceedings of The Third Workshop on Simple and Efficient Natural Language Processing (SustaiNLP), pp. 52–64, 2022.
- HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021. doi: 10.1109/TASLP.2021.3122291.
- The state and fate of linguistic diversity and inclusion in the nlp world. arXiv preprint arXiv:2004.09095, 2020.
- Small data? no problem! exploring the viability of pretrained multilingual language models for low-resourced languages. In Proceedings of the 1st Workshop on Multilingual Representation Learning, pp. 116–126, 2021.
- AfriSpeech-200: Pan-African Accented Speech Dataset for Clinical and General Domain ASR. Transactions of the Association for Computational Linguistics, 11:1669–1685, 12 2023. ISSN 2307-387X. doi: 10.1162/tacl˙a˙00627. URL https://doi.org/10.1162/tacl_a_00627.
- fairseq: A fast, extensible toolkit for sequence modeling. CoRR, abs/1904.01038, 2019. URL http://arxiv.org/abs/1904.01038.
- How multilingual is multilingual BERT? In Anna Korhonen, David Traum, and Lluís Màrquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4996–5001, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1493. URL https://aclanthology.org/P19-1493.
- Scaling speech technology to 1,000+ languages, 2023.
- Speechbrain: A general-purpose speech toolkit, 2021.
- Large vocabulary speech recognition for languages of Africa: multilingual modeling and self-supervised learning, 2022. URL https://arxiv.org/abs/2208.03067.
- Bembaspeech: A speech recognition corpus for the bemba language, 2021.
- Voxlingua107: a dataset for spoken language recognition. In 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 652–658. IEEE, 2021.
- Kenswquad—a question answering dataset for swahili low-resource language. ACM Trans. Asian Low-Resour. Lang. Inf. Process., 22(4), apr 2023. ISSN 2375-4699. doi: 10.1145/3578553. URL https://doi.org/10.1145/3578553.
- A survey of multilingual models for automatic speech recognition, 2022.
- Google usm: Scaling automatic speech recognition beyond 100 languages, 2023.
- Improving automatic speech recognition performance for low-resource languages with self-supervised models. IEEE Journal of Selected Topics in Signal Processing, 16(6):1227–1241, 2022. doi: 10.1109/JSTSP.2022.3184480.
- Antoine Caubrière (9 papers)
- Elodie Gauthier (4 papers)