AfriHuBERT: A Self-Supervised Speech Model for African Languages
Overview
The paper presents AfriHuBERT, a significant extension of the mHuBERT model, aiming to enhance speech representation for African languages. The authors extend coverage from 16 to 39 African languages, encompassing over 6,500 hours of speech data drawn from diverse sources. Evaluations on Language Identification (LID) and Automatic Speech Recognition (ASR) demonstrate a notable improvement over existing models, with a 4% increase in F1 score for LID and a 1.2% decrease in Word Error Rate (WER) for ASR.
Background
Self-supervised learning (SSL) models such as HuBERT, XLSR, and WavLabLM have become integral to speech-related applications. These models are trained on extensive unlabeled data to capture linguistic nuances effectively. The multilingual implementations, while broad in language coverage, do not sufficiently represent African languages. This gap prompted the development of AfriHuBERT, which builds on mHuBERT's foundations by incorporating more African languages through additional pretraining.
Methodology
Data Collection and Preprocessing
The authors accumulated 6,551 hours of speech data from eight primary sources, involving languages from distinct families such as Afro-Asiatic and Niger-Congo. The datasets had various domains, such as religious and general content, with speech types ranging from read to spontaneous. Preprocessing involved standardizing the audio data to a 16 kHz sampling rate and segmenting it into manageable chunks. A noticeable challenge was ensuring quality across the less-resourced languages.
Model Training and Evaluation
AfriHuBERT employed an adaptation strategy for continued pretraining on additional African languages. Faiss-based clustering methods were utilized to generate pseudo-labels. The model was evaluated on the FLEURS dataset across LID and ASR tasks. Results reflected a substantial improvement over mHuBERT, with accuracy gains in both axes of language coverage and task performance.
Results
AfriHuBERT significantly improved LID and ASR performance compared to its predecessors and contemporary models:
- LID: Achieved an average F1 score of 92%, outperforming other models. It demonstrated enhanced capability in handling languages with initially low or no coverage.
- ASR: Presented a 1.2% reduction in average WER across languages, with further improvements observed when excluding non-African languages from evaluation metrics.
Implications and Future Directions
The enhanced coverage and performance of AfriHuBERT suggest practical benefits for developing inclusive speech technologies in African contexts. The research implies a strong potential for further improvements through addressing the identified data quality challenges, especially in the transcription of low-resource languages. Future work could involve extending coverage to more languages and developing more robust, noise-immune evaluation benchmarks to ensure accurate representation and leveraging of African languages in AI systems.
Conclusion
The authors successfully demonstrate the scalability and adaptability of the self-supervised approach for multilingual speech representation, specifically tailored for African languages. This work not only contributes to filling the linguistic gaps in current models but also emphasizes the importance of diversity in SSL tasks to accommodate the global linguistic landscape comprehensively.