Annotated Speech Corpus for Low Resource Indian Languages: Awadhi, Bhojpuri, Braj and Magahi (2206.12931v1)
Abstract: In this paper we discuss an in-progress work on the development of a speech corpus for four low-resource Indo-Aryan languages -- Awadhi, Bhojpuri, Braj and Magahi using the field methods of linguistic data collection. The total size of the corpus currently stands at approximately 18 hours (approx. 4-5 hours each language) and it is transcribed and annotated with grammatical information such as part-of-speech tags, morphological features and Universal dependency relationships. We discuss our methodology for data collection in these languages, most of which was done in the middle of the COVID-19 pandemic, with one of the aims being to generate some additional income for low-income groups speaking these languages. In the paper, we also discuss the results of the baseline experiments for automatic speech recognition system in these languages.
- Ritesh Kumar (42 papers)
- Siddharth Singh (42 papers)
- Shyam Ratan (5 papers)
- Mohit Raj (3 papers)
- Sonal Sinha (2 papers)
- Bornini Lahiri (5 papers)
- Vivek Seshadri (25 papers)
- Kalika Bali (27 papers)
- Atul Kr. Ojha (19 papers)