IndicVoices: Towards building an Inclusive Multilingual Speech Dataset for Indian Languages (2403.01926v1)
Abstract: We present INDICVOICES, a dataset of natural and spontaneous speech containing a total of 7348 hours of read (9%), extempore (74%) and conversational (17%) audio from 16237 speakers covering 145 Indian districts and 22 languages. Of these 7348 hours, 1639 hours have already been transcribed, with a median of 73 hours per language. Through this paper, we share our journey of capturing the cultural, linguistic and demographic diversity of India to create a one-of-its-kind inclusive and representative dataset. More specifically, we share an open-source blueprint for data collection at scale comprising of standardised protocols, centralised tools, a repository of engaging questions, prompts and conversation scenarios spanning multiple domains and topics of interest, quality control mechanisms, comprehensive transcription guidelines and transcription tools. We hope that this open source blueprint will serve as a comprehensive starter kit for data collection efforts in other multilingual regions of the world. Using INDICVOICES, we build IndicASR, the first ASR model to support all the 22 languages listed in the 8th schedule of the Constitution of India. All the data, tools, guidelines, models and other materials developed as a part of this work will be made publicly available
- A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in Proceedings of the 40th International Conference on Machine Learning, ICML’23, JMLR.org, 2023.
- A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual (H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, eds.), 2020.
- A. Baevski, W. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “data2vec: A general framework for self-supervised learning in speech, vision and language,” in International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA (K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvári, G. Niu, and S. Sabato, eds.), vol. 162 of Proceedings of Machine Learning Research, pp. 1298–1312, PMLR, 2022.
- S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian, Y. Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE J. Sel. Top. Signal Process., vol. 16, no. 6, pp. 1505–1518, 2022.
- A. Gulati, J. Qin, C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang, “Conformer: Convolution-augmented transformer for speech recognition,” in Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020 (H. Meng, B. Xu, and T. F. Zheng, eds.), pp. 5036–5040, ISCA, 2020.
- V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210, 2015.
- R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” 2020.
- Springer International Publishing, 2018.
- C. Wang, M. Rivière, A. Lee, A. Wu, C. Talnikar, D. Haziza, M. Williamson, J. Pino, and E. Dupoux, “Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,” 2021.
- T. Javed, S. Joshi, V. Nagarajan, S. Sundaresan, J. Nawale, A. Raman, K. Bhogale, P. Kumar, and M. M. Khapra, “Svarah: Evaluating English ASR Systems on Indian Accents,” in Proc. INTERSPEECH 2023, pp. 5087–5091, 2023.
- Y. Zhang, W. Han, J. Qin, Y. Wang, A. Bapna, Z. Chen, N. Chen, B. Li, V. Axelrod, G. Wang, Z. Meng, K. Hu, A. Rosenberg, R. Prabhavalkar, D. S. Park, P. Haghani, J. Riesa, G. Perng, H. Soltau, T. Strohman, B. Ramabhadran, T. Sainath, P. Moreno, C.-C. Chiu, J. Schalkwyk, F. Beaufays, and Y. Wu, “Google usm: Scaling automatic speech recognition beyond 100 languages,” 2023.
- V. Pratap, A. Tjandra, B. Shi, P. Tomasello, A. Babu, S. Kundu, A. Elkahky, Z. Ni, A. Vyas, M. Fazel-Zarandi, A. Baevski, Y. Adi, X. Zhang, W. Hsu, A. Conneau, and M. Auli, “Scaling speech technology to 1, 000+ languages,” CoRR, vol. abs/2305.13516, 2023.
- T. Javed, S. Doddapaneni, A. Raman, K. S. Bhogale, G. Ramesh, A. Kunchukuttan, P. Kumar, and M. M. Khapra, “Towards building ASR systems for the next billion users,” in Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022, pp. 10813–10821, AAAI Press, 2022.
- P. Duquenne, H. Schwenk, and B. Sagot, “SONAR: sentence-level multimodal and language-agnostic representations,” CoRR, vol. abs/2308.11466, 2023.
- K. S. Bhogale, A. Raman, T. Javed, S. Doddapaneni, A. Kunchukuttan, P. Kumar, and M. M. Khapra, “Effectiveness of mining audio and text pairs from public data for improving ASR systems for low-resource languages,” in IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, Rhodes Island, Greece, June 4-10, 2023, pp. 1–5, IEEE, 2023.
- A. Conneau, M. Ma, S. Khanuja, Y. Zhang, V. Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “Fleurs: Few-shot learning evaluation of universal representations of speech,” 2022.
- B. M. L. Srivastava, S. Sitaram, R. Kumar Mehta, K. Doss Mohan, P. Matani, S. Satpal, K. Bali, R. Srikanth, and N. Nayak, “Interspeech 2018 Low Resource Automatic Speech Recognition Challenge for Indian Languages,” in Proc. 6th Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU 2018), pp. 11–14, 2018.
- O. Kjartansson, S. Sarin, K. Pipatsrisawat, M. Jansche, and L. Ha, “Crowd-Sourced Speech Corpora for Javanese, Sundanese, Sinhala, Nepali, and Bangladeshi Bengali,” in Proc. The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU), (Gurugram, India), pp. 52–55, Aug. 2018.
- F. He, S.-H. C. Chu, O. Kjartansson, C. Rivera, A. Katanova, A. Gutkin, I. Demirsahin, C. Johny, M. Jansche, S. Sarin, and K. Pipatsrisawat, “Open-source multi-speaker speech corpora for building Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu speech synthesis systems,” in Proceedings of the Twelfth Language Resources and Evaluation Conference (N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, and S. Piperidis, eds.), (Marseille, France), pp. 6494–6503, European Language Resources Association, May 2020.
- A. Diwan, R. Vaideeswaran, S. Shah, A. Singh, S. Raghavan, S. Khare, V. Unni, S. Vyas, A. Rajpuria, C. Yarra, A. Mittal, P. K. Ghosh, P. Jyothi, K. Bali, V. Seshadri, S. Sitaram, S. Bharadwaj, J. Nanavati, R. Nanavati, K. Sankaranarayanan, T. Seeram, and B. Abraham, “Multilingual and code-switching asr challenges for low resource indian languages,” Proceedings of Interspeech, 2021.
- T. Javed, K. Bhogale, A. Raman, P. Kumar, A. Kunchukuttan, and M. M. Khapra, “Indicsuperb: a speech processing universal performance benchmark for indian languages,” in Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence, AAAI’23/IAAI’23/EAAI’23, AAAI Press, 2023.
- K. Bhogale, A. Raman, T. Javed, S. Doddapaneni, A. Kunchukuttan, P. Kumar, and M. M. Khapra, “Effectiveness of mining audio and text pairs from public data for improving asr systems for low-resource languages,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5, 2023.
- M. A, B. Pilar, and R. A. G, “Subword dictionary learning and segmentation techniques for automatic speech recognition in tamil and kannada,” 2022.
- M. A, B. Pilar, and R. A. G, “Knowledge-driven subword grammar modeling for automatic speech recognition in tamil and kannada,” 2022.
- D. Adiga, R. Kumar, A. Krishna, P. Jyothi, G. Ramakrishnan, and P. Goyal, “Automatic speech recognition in Sanskrit: A new speech corpus and modelling insights,” in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 (C. Zong, F. Xia, W. Li, and R. Navigli, eds.), (Online), pp. 5039–5050, Association for Computational Linguistics, Aug. 2021.
- K. Prahallad, N. K. Elluru, V. Keri, S. Rajendran, and A. W. Black, “The iiit-h indic speech databases,” in Interspeech, 2012.
- B. Abraham, D. Goel, D. Siddarth, K. Bali, M. Chopra, M. Choudhury, P. Joshi, P. Jyothi, S. Sitaram, and V. Seshadri, “Crowdsourcing speech data for low-resource languages from low-income workers,” in Proceedings of the 12th Conference on Language Resources and Evaluation (LREC), pp. 2819–2826, 2020.
- K. S. Bhogale, S. Sundaresan, A. Raman, T. Javed, M. M. Khapra, and P. Kumar, “Vistaar: Diverse benchmarks and training sets for indian language asr,” ArXiv, vol. abs/2305.15386, 2023.
- “Resources for indian languages,” 2016.
- N. R, M. S, J. F, A. Gangwar, M. N. J, S. Umesh, R. Sarab, A. K. Dubey, G. Divakaran, S. V. K, and S. V. Gangashetty, “Spring-inx: A multilingual indian language speech corpus by spring lab, iit madras,” 2023.
- A. Singh, C. Shah, R. Varadaraj, S. Chauhan, and P. K. Ghosh, “Spire-sies: A spontaneous indian english speech corpus,” 2023.
- J. FitzGerald, C. Hench, C. Peris, S. Mackie, K. Rottmann, A. Sanchez, A. Nash, L. Urbach, V. Kakarala, R. Singh, S. Ranganath, L. Crist, M. Britan, W. Leeuwis, G. Tur, and P. Natarajan, “Massive: A 1m-example multilingual natural language understanding dataset with 51 typologically-diverse languages,” 2022.
- M. Chopra, I. Medhi Thies, J. Pal, C. Scott, W. Thies, and V. Seshadri, “Exploring crowdsourced work in low-resource settings,” in Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI ’19, (New York, NY, USA), p. 1–13, Association for Computing Machinery, 2019.
- S. Team, “Silero vad: pre-trained enterprise-grade voice activity detector (vad), number detector and language classifier.” https://github.com/snakers4/silero-vad, 2021.
- AI4Bharat, “Shoonya: An open source platform to annotate and label data at scale,” 2023.
- A. Tjandra, N. Singhal, D. Zhang, O. Kalinli, A. Mohamed, D. Le, and M. L. Seltzer, “Massively multilingual ASR on 70 languages: Tokenization, architecture, and generalization capabilities,” in IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, Rhodes Island, Greece, June 4-10, 2023, pp. 1–5, IEEE, 2023.
- Tahir Javed (9 papers)
- Janki Atul Nawale (1 paper)
- Eldho Ittan George (2 papers)
- Sakshi Joshi (4 papers)
- Kaushal Santosh Bhogale (6 papers)
- Deovrat Mehendale (4 papers)
- Ishvinder Virender Sethi (1 paper)
- Aparna Ananthanarayanan (1 paper)
- Hafsah Faquih (1 paper)
- Pratiti Palit (1 paper)
- Sneha Ravishankar (1 paper)
- Saranya Sukumaran (1 paper)
- Tripura Panchagnula (1 paper)
- Sunjay Murali (1 paper)
- Kunal Sharad Gandhi (1 paper)
- Ambujavalli R (1 paper)
- Manickam K M (1 paper)
- C Venkata Vaijayanthi (1 paper)
- Krishnan Srinivasa Raghavan Karunganni (1 paper)
- Pratyush Kumar (44 papers)