Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Common Voice: A Massively-Multilingual Speech Corpus (1912.06670v2)

Published 13 Dec 2019 in cs.CL and cs.LG

Abstract: The Common Voice corpus is a massively-multilingual collection of transcribed speech intended for speech technology research and development. Common Voice is designed for Automatic Speech Recognition purposes but can be useful in other domains (e.g. language identification). To achieve scale and sustainability, the Common Voice project employs crowdsourcing for both data collection and data validation. The most recent release includes 29 languages, and as of November 2019 there are a total of 38 languages collecting data. Over 50,000 individuals have participated so far, resulting in 2,500 hours of collected audio. To our knowledge this is the largest audio corpus in the public domain for speech recognition, both in terms of number of hours and number of languages. As an example use case for Common Voice, we present speech recognition experiments using Mozilla's DeepSpeech Speech-to-Text toolkit. By applying transfer learning from a source English model, we find an average Character Error Rate improvement of 5.99 +/- 5.48 for twelve target languages (German, French, Italian, Turkish, Catalan, Slovenian, Welsh, Irish, Breton, Tatar, Chuvash, and Kabyle). For most of these languages, these are the first ever published results on end-to-end Automatic Speech Recognition.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Rosana Ardila (1 paper)
  2. Megan Branson (1 paper)
  3. Kelly Davis (2 papers)
  4. Michael Henretty (1 paper)
  5. Michael Kohler (23 papers)
  6. Josh Meyer (6 papers)
  7. Reuben Morais (2 papers)
  8. Lindsay Saunders (1 paper)
  9. Francis M. Tyers (7 papers)
  10. Gregor Weber (1 paper)
Citations (1,373)

Summary

Common Voice: A Massively-Multilingual Speech Corpus

The paper "Common Voice: A Massively-Multilingual Speech Corpus" introduces an extensive and publicly available dataset aimed at enhancing speech technology research and development, specifically designed for Automatic Speech Recognition (ASR) applications. The dataset is noteworthy due to its broad language coverage, including 29 released languages and ongoing data collection efforts for a total of 38 languages as of November 2019.

The Common Voice project, initiated by Mozilla, uniquely leverages crowdsourcing for both data collection and validation, crucially reducing the conventional costs and restrictions associated with obtaining training data. This decentralized data collection approach aligns with Mozilla's objective to make speech technology open and accessible.

To maintain a standardized data acquisition and validation process, contributors record their speech via the Common Voice website or mobile application, reading pre-defined text sentences. The validation mechanism required contributors to up-vote or down-vote recordings to ensure the accuracy and reliability of the dataset.

Key Contributions

  1. Corpus Scale and Accessibility:
    • As of the report, the Common Voice corpus contained over 2,500 hours of validated audio contributed by more than 50,000 individuals. The dataset is released under a Creative Commons CC0 license, making it one of the largest publicly accessible speech corpora.
  2. Broad Language Variety:
    • The corpus supports both well-resourced and low-resource languages. For example, major languages like German and French had substantial data, while minority languages like Kabyle and Breton were also included, promoting inclusivity and diversity in ASR research.
  3. Crowdsourcing and Community Engagement:
    • The interactive and participatory nature of the project enabled widespread community involvement, which not only enhanced the scale of the dataset but also ensured its continuous growth and adaptability.

Methodology and Validation

The paper detailed the methods for recording and validating audio clips. The recording interface allowed users to provide their speech data in a controlled and systematic manner. The validation process implemented a voting mechanism where an audio clip was validated if it received two up-votes, ensuring a high level of data accuracy.

Statistical analyses were conducted to determine the appropriate division of data into training, validation, and test sets—ensuring that each set was independent and unbiased. This meticulous segmentation was pivotal for fair model evaluation and training.

ASR Experiments and Results

The paper also described ASR experiments using the Common Voice corpus, employing Mozilla's DeepSpeech toolkit. The research applied an End-to-End Transfer Learning approach, highlighting significant improvements in Character Error Rate (CER) across multiple languages.

Numerical Results:

  • The transfer learning experiments demonstrated an average CER improvement of 5.99±5.485.99 \pm 5.48 for twelve target languages, underscoring the efficacy of the corpus even in low-resource language settings. These results were particularly significant for languages like Slovenian, Irish, and Tatar, where ASR research had been previously limited or non-existent.

Implications and Future Directions

The practical implications of the Common Voice corpus are substantial. By providing a diverse and scalable dataset, the project facilitates advancements in ASR for a wide array of languages, including those with limited digital resources. The open and crowdsourced nature of this project empowers global communities to contribute to and benefit from speech technology advancements.

From a theoretical perspective, the results highlight the potential of transfer learning in multilingual ASR scenarios. As the corpus grows, there will likely be further improvements in ASR performance for both high- and low-resource languages.

Future developments could include:

  • Expanding the corpus to include more languages and dialects.
  • Enhancing the quality and variety of recorded speech to cover more accents and demographic variations.
  • Exploring the implications of the dataset for other speech technology fields such as speaker identification and language translation.

In conclusion, the Common Voice corpus represents a significant step towards democratizing speech technology. By leveraging a collaborative, open-source approach, this initiative provides the resources necessary to advance ASR research globally, benefiting both the academic community and the broader public.

Youtube Logo Streamline Icon: https://streamlinehq.com