Vistaar: Diverse Benchmarks and Training Sets for Indian Language ASR (2305.15386v2)

Published 24 May 2023 in cs.CL, cs.SD, and eess.AS

Abstract: Improving ASR systems is necessary to make new LLM-based use-cases accessible to people across the globe. In this paper, we focus on Indian languages, and make the case that diverse benchmarks are required to evaluate and improve ASR systems for Indian languages. To address this, we collate Vistaar as a set of 59 benchmarks across various language and domain combinations, on which we evaluate 3 publicly available ASR systems and 2 commercial systems. We also train IndicWhisper models by fine-tuning the Whisper models on publicly available training datasets across 12 Indian languages totalling to 10.7K hours. We show that IndicWhisper significantly improves on considered ASR systems on the Vistaar benchmark. Indeed, IndicWhisper has the lowest WER in 39 out of the 59 benchmarks, with an average reduction of 4.1 WER. We open-source all datasets, code and models.

Authors (6)

Kaushal Santosh Bhogale (6 papers)
Sai Sundaresan (3 papers)
Abhigyan Raman (5 papers)
Tahir Javed (9 papers)
Mitesh M. Khapra (79 papers)
Pratyush Kumar (44 papers)

Citations (8)

View on Semantic Scholar

Summary

Vistaar: Diverse Benchmarks and Training Sets for Indian Language ASR

The paper "Vistaar: Diverse Benchmarks and Training Sets for Indian Language ASR" by Kaushal Santosh Bhogale et al. addresses critical improvements in Automatic Speech Recognition (ASR) systems within the context of Indian languages. With a significant portion of the Indian population being print illiterate and the nation's linguistic diversity, the development of accurate ASR systems becomes significantly impactful. This paper proposes Vistaar, a collection of diverse benchmarks, and introduces IndicWhisper, a family of ASR models fine-tuned for Indian languages.

Key Contributions

Vistaar Benchmark Compilation: The authors have curated a set of 59 benchmarks across 12 Indian languages and various domains/data types to evaluate ASR systems. This includes datasets like Kathbath, CommonVoice, FLEURS, and others. These benchmarks reveal the diversity in speakers, audio collection environments, and domains, encompassing both studio-quality and crowd-sourced audio data.
IndicWhisper ASR Models: By fine-tuning OpenAI's Whisper models on a comprehensive set of training data termed Vistaar-Train, the paper introduces IndicWhisper. This training set aggregates over 10,000 hours of audio across 12 languages, from sources like Shrutilipi, NPTEL, and IndicTTS, among others.
Evaluation Results: IndicWhisper models exhibit superior performance against publicly available models like IndicWav2Vec and commercial ASR systems such as Google and Azure. IndicWhisper achieves the lowest Word Error Rate (WER) in 39 out of the 59 benchmarks.

Detailed Analysis

The paper presents a meticulous evaluation of existing ASR systems using the Vistaar benchmark. Results show noticeable discrepancies among ASR models, with IndicWhisper models outperforming others by significant margins, especially in challenging acoustic environments. The empirical results highlight the variability in ASR performance depending heavily on the dataset used, indicating that reliance on a singular benchmark might misrepresent a model's effectiveness across different conditions and languages.

Implications and Future Directions

This research has far-reaching implications in enhancing accessibility and user interaction with technology for non-English speakers through robust ASR systems. The successful development and deployment of such systems could lead to substantial societal impacts, potentially transforming how information and services are accessed in linguistically diverse regions.

Future work should explore:

Expanding training datasets to include even more linguistic variety,
Developing strategies to balance ASR accuracy across different languages with varied amounts of training data,
Implementing domain-specific LLMs that can complement generic acoustic models, thereby improving performance in specialized applications.

In conclusion, the paper provides a substantial contribution to the field of ASR for low-resource languages by establishing a well-rounded benchmark and demonstrating the benefits of diverse training datasets through the IndicWhisper models. The complete open-sourcing of the datasets, code, and models enhances reproducibility and further research potential in this domain.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos