AfriHuBERT: A self-supervised speech representation model for African languages (2409.20201v1)

Published 30 Sep 2024 in cs.CL, cs.SD, and eess.AS

Abstract: In this work, we present AfriHuBERT, an extension of mHuBERT-147, a state-of-the-art (SOTA) and compact self-supervised learning (SSL) model, originally pretrained on 147 languages. While mHuBERT-147 was pretrained on 16 African languages, we expand this to cover 39 African languages through continued pretraining on 6,500+ hours of speech data aggregated from diverse sources, including 23 newly added languages. We evaluate AfriHuBERT on two key speech tasks: Language Identification (LID) and Automatic Speech Recognition (ASR) using FLEURS dataset. Our results show a +4% F1 score improvement on average for LID and a -1.2% average Word Error Rate (WER) reduction for ASR. Further analysis shows that ASR models trained on AfriHuBERT exhibit improved cross-corpus generalization. Additionally, the analysis indicates that the FLEURS have data quality limitations that may affect their suitability for evaluating low-resource African languages, suggesting the need for better evaluation benchmarks for these languages.

PDF HTML Abstract

AfriHuBERT: A Self-Supervised Speech Model for African Languages

Overview

The paper presents AfriHuBERT, a significant extension of the mHuBERT model, aiming to enhance speech representation for African languages. The authors extend coverage from 16 to 39 African languages, encompassing over 6,500 hours of speech data drawn from diverse sources. Evaluations on Language Identification (LID) and Automatic Speech Recognition (ASR) demonstrate a notable improvement over existing models, with a 4% increase in F1 score for LID and a 1.2% decrease in Word Error Rate (WER) for ASR.

Background

Self-supervised learning (SSL) models such as HuBERT, XLSR, and WavLabLM have become integral to speech-related applications. These models are trained on extensive unlabeled data to capture linguistic nuances effectively. The multilingual implementations, while broad in language coverage, do not sufficiently represent African languages. This gap prompted the development of AfriHuBERT, which builds on mHuBERT's foundations by incorporating more African languages through additional pretraining.

Methodology

Data Collection and Preprocessing

The authors accumulated 6,551 hours of speech data from eight primary sources, involving languages from distinct families such as Afro-Asiatic and Niger-Congo. The datasets had various domains, such as religious and general content, with speech types ranging from read to spontaneous. Preprocessing involved standardizing the audio data to a 16 kHz sampling rate and segmenting it into manageable chunks. A noticeable challenge was ensuring quality across the less-resourced languages.

Model Training and Evaluation

AfriHuBERT employed an adaptation strategy for continued pretraining on additional African languages. Faiss-based clustering methods were utilized to generate pseudo-labels. The model was evaluated on the FLEURS dataset across LID and ASR tasks. Results reflected a substantial improvement over mHuBERT, with accuracy gains in both axes of language coverage and task performance.

Results

AfriHuBERT significantly improved LID and ASR performance compared to its predecessors and contemporary models:

LID: Achieved an average F1 score of 92%, outperforming other models. It demonstrated enhanced capability in handling languages with initially low or no coverage.
ASR: Presented a 1.2% reduction in average WER across languages, with further improvements observed when excluding non-African languages from evaluation metrics.

Implications and Future Directions

The enhanced coverage and performance of AfriHuBERT suggest practical benefits for developing inclusive speech technologies in African contexts. The research implies a strong potential for further improvements through addressing the identified data quality challenges, especially in the transcription of low-resource languages. Future work could involve extending coverage to more languages and developing more robust, noise-immune evaluation benchmarks to ensure accurate representation and leveraging of African languages in AI systems.

Conclusion

The authors successfully demonstrate the scalability and adaptability of the self-supervised approach for multilingual speech representation, specifically tailored for African languages. This work not only contributes to filling the linguistic gaps in current models but also emphasizes the importance of diversity in SSL tasks to accommodate the global linguistic landscape comprehensively.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Jesujoba O. Alabi (20 papers)
Xuechen Liu (37 papers)
Dietrich Klakow (114 papers)
Junichi Yamagishi (178 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/mzboito/status/1843189144031555765