Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
107 tokens/sec
Gemini 2.5 Pro Premium
58 tokens/sec
GPT-5 Medium
29 tokens/sec
GPT-5 High Premium
25 tokens/sec
GPT-4o
101 tokens/sec
DeepSeek R1 via Azure Premium
84 tokens/sec
GPT OSS 120B via Groq Premium
478 tokens/sec
Kimi K2 via Groq Premium
213 tokens/sec
2000 character limit reached

MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages (2410.01036v1)

Published 1 Oct 2024 in cs.CL, cs.AI, cs.SD, and eess.AS

Abstract: The rise of foundation models (FMs), coupled with regulatory efforts addressing their risks and impacts, has sparked significant interest in open-source models. However, existing speech FMs (SFMs) fall short of full compliance with the open-source principles, even if claimed otherwise, as no existing SFM has model weights, code, and training data publicly available under open-source terms. In this work, we take the first step toward filling this gap by focusing on the 24 official languages of the European Union (EU). We collect suitable training data by surveying automatic speech recognition datasets and unlabeled speech corpora under open-source compliant licenses, for a total of 950k hours. Additionally, we release automatic transcripts for 441k hours of unlabeled data under the permissive CC-BY license, thereby facilitating the creation of open-source SFMs for the EU languages.

Summary

  • The paper presents a 950,000-hour open-source speech dataset covering 24 EU languages for training robust speech foundation models.
  • It utilizes an automatic transcription process with Whisper large v3 on 441,000 hours of unlabeled data, significantly reducing WER in low-resource settings.
  • The study underscores the importance of open-source principles in AI, addressing licensing challenges and promoting equitable, transparent speech recognition technologies.

MOSEL: 950,000 Hours of Speech Data for EU Languages

The paper at hand presents an extensive effort toward creating open-source compliant Speech Foundation Models (SFMs) for the 24 official languages of the European Union (EU). The paper focuses on assembling a large corpus of speech data, dubbed MOSEL, with the aim of facilitating the development of truly open-source SFMs.

Overview

MOSEL encompasses approximately 950,000 hours of speech data, sourced under open-source compliant licenses. This effort directly addresses the gap in existing speech foundation models, which lack full compliance with open-source principles due to restricted access to model weights, code, and training data.

The paper highlights the importance of open science principles in AI, specifically by emphasizing that access to training data under an open-source license is crucial. Current models like Whisper and OWSM fall short due to restrictions like non-commercial use or prohibitions on derivative works.

Methodology

The authors undertook a comprehensive survey of existing datasets, collating those that permit open-source use. The paper details the inclusion of 18 datasets, totaling 950,000 hours, covering both labeled and unlabeled speech. Some of the datasets include CommonVoice, CoVoST2, and the EU Parliament corpus. Notably, the collection comprises a diverse linguistic spread, although it is heavily skewed towards high-resource languages such as English, French, and German.

An automatic transcription process was implemented for 441,000 hours of unlabeled data, using the Whisper large v3 model. The paper acknowledges potential issues such as transcription inaccuracies and data imbalance but provides strategies like language identification and hallucination filtering to mitigate these.

Results

A notable proof-of-concept experiment using Maltese, a low-resource language, demonstrated the utility of MOSEL. The results showed significant improvement over existing models, with a reduction in Word Error Rate (WER) from over 80 to approximately 24 after filtering and using both labeled and pseudo-labeled data.

Implications and Future Directions

The creation of MOSEL has practical implications for the development of speech recognition technologies in the EU context, potentially leading to more equitable resources across languages. From a theoretical standpoint, the work contributes to broader discussions on open-source standards and data governance in AI.

Future work suggested includes extending the dataset collection to additional spoken languages and improving data curation and filtering techniques. The lack of open-source data for Irish highlights the need for continued efforts in data acquisition, especially for less-resourced languages.

Conclusion

This paper offers a foundational step towards the establishment of an EU-compliant Open Source Speech Foundation Model. With a significant amount of open-source compliant data captured in MOSEL, it paves the way for more inclusive and transparent speech recognition systems. The research underscores the importance of open-access models in the AI community, echoing broader calls for transparency and reproducibility in the development and deployment of AI technologies.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Reddit Logo Streamline Icon: https://streamlinehq.com