- The paper presents a 950,000-hour open-source speech dataset covering 24 EU languages for training robust speech foundation models.
- It utilizes an automatic transcription process with Whisper large v3 on 441,000 hours of unlabeled data, significantly reducing WER in low-resource settings.
- The study underscores the importance of open-source principles in AI, addressing licensing challenges and promoting equitable, transparent speech recognition technologies.
MOSEL: 950,000 Hours of Speech Data for EU Languages
The paper at hand presents an extensive effort toward creating open-source compliant Speech Foundation Models (SFMs) for the 24 official languages of the European Union (EU). The paper focuses on assembling a large corpus of speech data, dubbed MOSEL, with the aim of facilitating the development of truly open-source SFMs.
Overview
MOSEL encompasses approximately 950,000 hours of speech data, sourced under open-source compliant licenses. This effort directly addresses the gap in existing speech foundation models, which lack full compliance with open-source principles due to restricted access to model weights, code, and training data.
The paper highlights the importance of open science principles in AI, specifically by emphasizing that access to training data under an open-source license is crucial. Current models like Whisper and OWSM fall short due to restrictions like non-commercial use or prohibitions on derivative works.
Methodology
The authors undertook a comprehensive survey of existing datasets, collating those that permit open-source use. The paper details the inclusion of 18 datasets, totaling 950,000 hours, covering both labeled and unlabeled speech. Some of the datasets include CommonVoice, CoVoST2, and the EU Parliament corpus. Notably, the collection comprises a diverse linguistic spread, although it is heavily skewed towards high-resource languages such as English, French, and German.
An automatic transcription process was implemented for 441,000 hours of unlabeled data, using the Whisper large v3 model. The paper acknowledges potential issues such as transcription inaccuracies and data imbalance but provides strategies like language identification and hallucination filtering to mitigate these.
Results
A notable proof-of-concept experiment using Maltese, a low-resource language, demonstrated the utility of MOSEL. The results showed significant improvement over existing models, with a reduction in Word Error Rate (WER) from over 80 to approximately 24 after filtering and using both labeled and pseudo-labeled data.
Implications and Future Directions
The creation of MOSEL has practical implications for the development of speech recognition technologies in the EU context, potentially leading to more equitable resources across languages. From a theoretical standpoint, the work contributes to broader discussions on open-source standards and data governance in AI.
Future work suggested includes extending the dataset collection to additional spoken languages and improving data curation and filtering techniques. The lack of open-source data for Irish highlights the need for continued efforts in data acquisition, especially for less-resourced languages.
Conclusion
This paper offers a foundational step towards the establishment of an EU-compliant Open Source Speech Foundation Model. With a significant amount of open-source compliant data captured in MOSEL, it paves the way for more inclusive and transparent speech recognition systems. The research underscores the importance of open-access models in the AI community, echoing broader calls for transparency and reproducibility in the development and deployment of AI technologies.