mHuBERT-147: A Compact Multilingual HuBERT Model (2406.06371v5)

Published 10 Jun 2024 in cs.CL, cs.SD, and eess.AS

Abstract: We present mHuBERT-147, the first general-purpose massively multilingual HuBERT speech representation model trained on 90K hours of clean, open-license data. To scale up the multi-iteration HuBERT approach, we use faiss-based clustering, achieving 5.2x faster label assignment than the original method. We also apply a new multilingual batching up-sampling strategy, leveraging both language and dataset diversity. After 3 training iterations, our compact 95M parameter mHuBERT-147 outperforms larger models trained on substantially more data. We rank second and first on the ML-SUPERB 10min and 1h leaderboards, with SOTA scores for 3 tasks. Across ASR/LID tasks, our model consistently surpasses XLS-R (300M params; 436K hours) and demonstrates strong competitiveness against the much larger MMS (1B params; 491K hours). Our findings indicate that mHuBERT-147 is a promising model for multilingual speech tasks, offering an unprecedented balance between high performance and parameter efficiency.

PDF HTML Abstract

mHuBERT-147: A Compact Multilingual HuBERT Model

The paper "mHuBERT-147: A Compact Multilingual HuBERT Model" presents a novel speech representation model, mHuBERT-147, trained on 90,430 hours of clean, open-license data across 147 languages. mHuBERT-147 is a multilingual extension of the HuBERT model, which leverages self-supervised learning (SSL) for speech representation. The model achieves significant results with only 95 million parameters, making it a highly efficient alternative to larger models while maintaining competitive performance.

Key Contributions and Methodology

Efficient Clustering with FAISS

One major innovation in this paper is the use of FAISS-based clustering instead of the traditional k-means approach. The authors implemented an optimized clustering process using FAISS's Inverted File Index (IVF), leading to a 5.2x improvement in label assignment speed. This efficiency in clustering significantly reduces the hardware and time costs associated with training multilingual HuBERT models.

Multilingual Up-sampling Strategy

To ensure balanced training across multiple languages, the authors introduced a two-level multilingual up-sampling strategy. This strategy takes into account both the linguistic diversity and dataset heterogeneity, thereby optimizing the training data sampling probabilities to improve the model's multilingual performance.

Training Efficiency and Data Handling

The mHuBERT-147 model demonstrates remarkable efficiency in multilingual settings. After three iterations of training, it outperforms larger models such as XLS-R (300M parameters; 436K hours) and shows strong competitiveness against the much larger MMS (1B parameters; 491K hours). With only 95M parameters, mHuBERT-147 ranks second and first on the ML-SUPERB 10min and 1h leaderboards respectively, achieving state-of-the-art (SOTA) scores for three LID tasks.

Numerical Results and Evaluation

On the ML-SUPERB benchmark, mHuBERT-147 demonstrates highly competitive performance. It secures the top position on the 10min leaderboard and the second position on the 1h leaderboard. The model achieves SOTA scores for LID tasks with significant margins. Additionally, in few-shot ASR evaluations on the FLEURS-102 dataset, mHuBERT-147 exhibits robust performance despite its compact size.

Practical and Theoretical Implications

Practical Implications

The mHuBERT-147 model's efficient training and competitive performance make it attractive for various speech processing applications, especially where resources are constrained. Its compact size ensures faster fine-tuning and inference times, making it highly suitable for deployment in real-time applications and low-resource environments. The open-source release of the model weights, scripts, and training data further democratizes access to advanced speech models for the research community.

Theoretical Implications

From a theoretical perspective, mHuBERT-147's success challenges the conventional emphasis on large models and extensive datasets for multilingual tasks. The model's robust performance with fewer parameters opens avenues for exploring more compact architectures in multilingual SSL. Additionally, the effectiveness of the FAISS-based clustering in this context may inspire future research into optimized clustering methods for other domains.

Future Directions

The promising results from mHuBERT-147 suggest several future research directions. Firstly, increasing the model's capacity could potentially address the observed saturation at the third training iteration and improve performance further. Secondly, exploring other efficient clustering techniques might yield even better resource utilizations. Lastly, extending the current multilingual datasets to cover more languages and dialects could enhance the model's adaptability and robustness across a more diverse set of linguistic environments.

Conclusion

The development of mHuBERT-147 marks an important step in multilingual speech representation modeling. By achieving a balance between high performance and parameter efficiency, this model sets new benchmarks on multiple leaderboards despite being more compact and using fewer resources. The methodologies and insights presented in this paper pave the way for future innovations in efficient multilingual SSL, with broad practical and theoretical implications.