Vistaar: Diverse Benchmarks and Training Sets for Indian Language ASR
The paper "Vistaar: Diverse Benchmarks and Training Sets for Indian Language ASR" by Kaushal Santosh Bhogale et al. addresses critical improvements in Automatic Speech Recognition (ASR) systems within the context of Indian languages. With a significant portion of the Indian population being print illiterate and the nation's linguistic diversity, the development of accurate ASR systems becomes significantly impactful. This paper proposes Vistaar, a collection of diverse benchmarks, and introduces IndicWhisper, a family of ASR models fine-tuned for Indian languages.
Key Contributions
- Vistaar Benchmark Compilation: The authors have curated a set of 59 benchmarks across 12 Indian languages and various domains/data types to evaluate ASR systems. This includes datasets like Kathbath, CommonVoice, FLEURS, and others. These benchmarks reveal the diversity in speakers, audio collection environments, and domains, encompassing both studio-quality and crowd-sourced audio data.
- IndicWhisper ASR Models: By fine-tuning OpenAI's Whisper models on a comprehensive set of training data termed Vistaar-Train, the paper introduces IndicWhisper. This training set aggregates over 10,000 hours of audio across 12 languages, from sources like Shrutilipi, NPTEL, and IndicTTS, among others.
- Evaluation Results: IndicWhisper models exhibit superior performance against publicly available models like IndicWav2Vec and commercial ASR systems such as Google and Azure. IndicWhisper achieves the lowest Word Error Rate (WER) in 39 out of the 59 benchmarks.
Detailed Analysis
The paper presents a meticulous evaluation of existing ASR systems using the Vistaar benchmark. Results show noticeable discrepancies among ASR models, with IndicWhisper models outperforming others by significant margins, especially in challenging acoustic environments. The empirical results highlight the variability in ASR performance depending heavily on the dataset used, indicating that reliance on a singular benchmark might misrepresent a model's effectiveness across different conditions and languages.
Implications and Future Directions
This research has far-reaching implications in enhancing accessibility and user interaction with technology for non-English speakers through robust ASR systems. The successful development and deployment of such systems could lead to substantial societal impacts, potentially transforming how information and services are accessed in linguistically diverse regions.
Future work should explore:
- Expanding training datasets to include even more linguistic variety,
- Developing strategies to balance ASR accuracy across different languages with varied amounts of training data,
- Implementing domain-specific LLMs that can complement generic acoustic models, thereby improving performance in specialized applications.
In conclusion, the paper provides a substantial contribution to the field of ASR for low-resource languages by establishing a well-rounded benchmark and demonstrating the benefits of diverse training datasets through the IndicWhisper models. The complete open-sourcing of the datasets, code, and models enhances reproducibility and further research potential in this domain.