Common Voice: A Massively-Multilingual Speech Corpus
The paper "Common Voice: A Massively-Multilingual Speech Corpus" introduces an extensive and publicly available dataset aimed at enhancing speech technology research and development, specifically designed for Automatic Speech Recognition (ASR) applications. The dataset is noteworthy due to its broad language coverage, including 29 released languages and ongoing data collection efforts for a total of 38 languages as of November 2019.
The Common Voice project, initiated by Mozilla, uniquely leverages crowdsourcing for both data collection and validation, crucially reducing the conventional costs and restrictions associated with obtaining training data. This decentralized data collection approach aligns with Mozilla's objective to make speech technology open and accessible.
To maintain a standardized data acquisition and validation process, contributors record their speech via the Common Voice website or mobile application, reading pre-defined text sentences. The validation mechanism required contributors to up-vote or down-vote recordings to ensure the accuracy and reliability of the dataset.
Key Contributions
- Corpus Scale and Accessibility:
- As of the report, the Common Voice corpus contained over 2,500 hours of validated audio contributed by more than 50,000 individuals. The dataset is released under a Creative Commons CC0 license, making it one of the largest publicly accessible speech corpora.
- Broad Language Variety:
- The corpus supports both well-resourced and low-resource languages. For example, major languages like German and French had substantial data, while minority languages like Kabyle and Breton were also included, promoting inclusivity and diversity in ASR research.
- Crowdsourcing and Community Engagement:
- The interactive and participatory nature of the project enabled widespread community involvement, which not only enhanced the scale of the dataset but also ensured its continuous growth and adaptability.
Methodology and Validation
The paper detailed the methods for recording and validating audio clips. The recording interface allowed users to provide their speech data in a controlled and systematic manner. The validation process implemented a voting mechanism where an audio clip was validated if it received two up-votes, ensuring a high level of data accuracy.
Statistical analyses were conducted to determine the appropriate division of data into training, validation, and test sets—ensuring that each set was independent and unbiased. This meticulous segmentation was pivotal for fair model evaluation and training.
ASR Experiments and Results
The paper also described ASR experiments using the Common Voice corpus, employing Mozilla's DeepSpeech toolkit. The research applied an End-to-End Transfer Learning approach, highlighting significant improvements in Character Error Rate (CER) across multiple languages.
Numerical Results:
- The transfer learning experiments demonstrated an average CER improvement of 5.99±5.48 for twelve target languages, underscoring the efficacy of the corpus even in low-resource language settings. These results were particularly significant for languages like Slovenian, Irish, and Tatar, where ASR research had been previously limited or non-existent.
Implications and Future Directions
The practical implications of the Common Voice corpus are substantial. By providing a diverse and scalable dataset, the project facilitates advancements in ASR for a wide array of languages, including those with limited digital resources. The open and crowdsourced nature of this project empowers global communities to contribute to and benefit from speech technology advancements.
From a theoretical perspective, the results highlight the potential of transfer learning in multilingual ASR scenarios. As the corpus grows, there will likely be further improvements in ASR performance for both high- and low-resource languages.
Future developments could include:
- Expanding the corpus to include more languages and dialects.
- Enhancing the quality and variety of recorded speech to cover more accents and demographic variations.
- Exploring the implications of the dataset for other speech technology fields such as speaker identification and language translation.
In conclusion, the Common Voice corpus represents a significant step towards democratizing speech technology. By leveraging a collaborative, open-source approach, this initiative provides the resources necessary to advance ASR research globally, benefiting both the academic community and the broader public.