IndicVoices: Towards building an Inclusive Multilingual Speech Dataset for Indian Languages

Published 4 Mar 2024 in cs.CL | (2403.01926v1)

Abstract: We present INDICVOICES, a dataset of natural and spontaneous speech containing a total of 7348 hours of read (9%), extempore (74%) and conversational (17%) audio from 16237 speakers covering 145 Indian districts and 22 languages. Of these 7348 hours, 1639 hours have already been transcribed, with a median of 73 hours per language. Through this paper, we share our journey of capturing the cultural, linguistic and demographic diversity of India to create a one-of-its-kind inclusive and representative dataset. More specifically, we share an open-source blueprint for data collection at scale comprising of standardised protocols, centralised tools, a repository of engaging questions, prompts and conversation scenarios spanning multiple domains and topics of interest, quality control mechanisms, comprehensive transcription guidelines and transcription tools. We hope that this open source blueprint will serve as a comprehensive starter kit for data collection efforts in other multilingual regions of the world. Using INDICVOICES, we build IndicASR, the first ASR model to support all the 22 languages listed in the 8th schedule of the Constitution of India. All the data, tools, guidelines, models and other materials developed as a part of this work will be made publicly available

Abstract PDF HTML Upgrade to Chat

Authors (21)

First 10 authors:

References (36)

Citations (11)

View on Semantic Scholar

Summary

The paper introduces IndicVoices, a large-scale multilingual speech dataset capturing 7348 hours of diverse audio from 16,237 speakers across 22 Indian languages.
It details a systematic data collection process emphasizing demographic diversity and real-world conditions to enhance ASR model accuracy.
The development of IndicASR, benchmarked on this dataset, demonstrates substantial improvements in speech recognition performance for underrepresented languages.

IndicVoices: Towards an Inclusive Multilingual Speech Dataset for Indian Languages

Introduction

The paper introduces IndicVoices, a comprehensive dataset encapsulating the linguistic, cultural, and demographic diversity of India, spanning 22 languages across 145 districts with contributions from 16,237 speakers. This initiative addresses the critical gap in labeled data for Indian languages, which has historically impeded the performance of Automatic Speech Recognition (ASR) technologies in non-English languages. The dataset, with a total of 7348 hours of audio data, predominantly comprises extempore (74%) and conversational (17%) speech, offering a rich resource for developing inclusive language technologies.

The Dataset's Composition and Collection Process

The paper delineates the meticulous process of dataset creation, emphasizing the commitment to capturing the multifaceted diversity of India. The authors crafted a dataset reflecting varied demographics (age, gender, educational background), types of speech (read, extempore, conversational), and recording conditions (diverse environments, wide/narrow-band recordings). A pivotal component of their methodology was the development of a centralized, open-source blueprint for scalable data collection. This framework facilitated the structured collection of spontaneous speech data reflecting real-world usage scenarios, thereby enhancing the dataset's applicability for practical ASR applications.

Comparison with Existing Datasets

IndicVoices distinguishes itself by its sheer scale and scope - covering 22 languages and providing extensive hours of transcribed speech, far surpassing existing datasets in terms of linguistic and demographic diversity. This breadth ensures a more holistic representation of India's linguistic landscape, making it an unparalleled resource for training robust, inclusive ASR models.

ASR Model Development and Benchmarking

Utilizing IndicVoices, the authors developed IndicASR, a pioneering ASR model supporting all 22 languages in the dataset. Initial benchmarking shows that IndicASR significantly outperforms existing models, underscoring the dataset's effectiveness in enhancing ASR performance for Indian languages. This model sets a new standard for speech recognition accuracy and inclusivity, demonstrating the potential of well-curated, diverse datasets in advancing language technologies.

Practical and Theoretical Implications

Beyond ASR, the dataset's structure and comprehensiveness offer vast potential for exploring several other speech and language processing tasks such as speaker diarization, language identification, and query by example. The open availability of IndicVoices and the accompanying tools and guidelines are poised to catalyze further research, making significant strides towards digital inclusivity and the development of speech technologies that cater to India's linguistic diversity.

Future Directions

The authors acknowledge certain limitations, such as the coverage of districts and the representation of conversational speech. Addressing these aspects in future iterations could further enhance the dataset's utility. Moreover, the ongoing collection and transcription efforts aim to expand the dataset, and subsequent work could focus on a more detailed evaluation across varied demographics and use cases. The development of IndicVoices is a stepping stone towards realizing the vision of truly inclusive speech technologies, opening avenues for multilingual research and applications.

Concluding Remarks

IndicVoices represents a significant contribution to the field of speech technology, particularly for the underrepresented languages of India. By facilitating the development of more accurate and inclusive ASR models, this work paves the way for greater digital accessibility and equity. Future research and innovations leveraging this dataset have the potential to transform the landscape of speech technology, making digital services more accessible to the linguistically diverse population of India.

Markdown Report Issue