Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 78 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 24 tok/s Pro
GPT-5 High 26 tok/s Pro
GPT-4o 120 tok/s Pro
Kimi K2 193 tok/s Pro
GPT OSS 120B 459 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Breaking Language Barriers: A Question Answering Dataset for Hindi and Marathi (2308.09862v3)

Published 19 Aug 2023 in cs.CL

Abstract: The recent advances in deep-learning have led to the development of highly sophisticated systems with an unquenchable appetite for data. On the other hand, building good deep-learning models for low-resource languages remains a challenging task. This paper focuses on developing a Question Answering dataset for two such languages- Hindi and Marathi. Despite Hindi being the 3rd most spoken language worldwide, with 345 million speakers, and Marathi being the 11th most spoken language globally, with 83.2 million speakers, both languages face limited resources for building efficient Question Answering systems. To tackle the challenge of data scarcity, we have developed a novel approach for translating the SQuAD 2.0 dataset into Hindi and Marathi. We release the largest Question-Answering dataset available for these languages, with each dataset containing 28,000 samples. We evaluate the dataset on various architectures and release the best-performing models for both Hindi and Marathi, which will facilitate further research in these languages. Leveraging similarity tools, our method holds the potential to create datasets in diverse languages, thereby enhancing the understanding of natural language across varied linguistic contexts. Our fine-tuned models, code, and dataset will be made publicly available.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. The question answering systems: A survey. International Journal of Research and Reviews in Information Sciences (IJRRIS), 2(3).
  2. HindiRC: A Dataset for Reading Comprehension in Hindi. In 20th International Conference on Computational Linguistics and Intelligent Text.
  3. On the Cross-lingual Transferability of Monolingual Representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
  4. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 8440–8451. Online: Association for Computational Linguistics.
  5. Question Answering System Using Ontology in Marathi Language.
  6. MMQA: A Multi-domain Multi-lingual Question-Answering Framework for English and Hindi. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan: European Language Resources Association (ELRA).
  7. MMQA: A Multi-domain Multi-lingual Question-Answering Framework for English and Hindi. In chair), N. C. C.; Choukri, K.; Cieri, C.; Declerck, T.; Goggi, S.; Hasida, K.; Isahara, H.; Maegaard, B.; Mariani, J.; Mazo, H.; Moreno, A.; Odijk, J.; Piperidis, S.; and Tokunaga, T., eds., Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Miyazaki, Japan: European Language Resources Association (ELRA). ISBN 979-10-95546-00-9.
  8. BERT Based Multilingual Machine Comprehension in English and Hindi.
  9. chaii - Hindi and Tamil Question Answering.
  10. Joshi, R. 2022a. L3Cube-HindBERT and DevBERT: Pre-Trained BERT Transformer models for Devanagari based Hindi and Marathi Languages. arXiv preprint arXiv:2211.11418.
  11. Joshi, R. 2022b. L3Cube-MahaCorpus and MahaBERT: Marathi Monolingual Corpus, Marathi BERT Language Models, and Resources. In Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference, 97–101. Marseille, France: European Language Resources Association.
  12. Joshi, R. 2022c. L3Cube-MahaNLP: Marathi Natural Language Processing Datasets, Models, and Library. arXiv preprint arXiv:2205.14728.
  13. Kamble S., B. S. 2018. Learning to Classify Marathi Questions and Identify Answer Type Using Machine Learning Technique.
  14. Adam: A Method for Stochastic Optimization. arXiv:1412.6980.
  15. MuCoT: Multilingual Contrastive Training for Question-Answering in Low-resource Languages.
  16. Natural Questions: a Benchmark for Question Answering Research. Transactions of the Association of Computational Linguistics.
  17. MLQA: Evaluating Cross-lingual Extractive Question Answering. arXiv preprint arXiv:1910.07475.
  18. MLQA: Evaluating Cross-lingual Extractive Question Answering.
  19. Lin, C.-Y. 2004. ROUGE: A Package for Automatic Evaluation of summaries. 10.
  20. L3Cube-MahaNER: A Marathi Named Entity Recognition Dataset and BERT models. In Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference, 29–34. Marseille, France: European Language Resources Association.
  21. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692.
  22. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, 311–318. USA: Association for Computational Linguistics.
  23. PyTorch: An Imperative Style, High-Performance Deep Learning Library. arXiv:1912.01703.
  24. Question Answering System for low resource language using Transfer Learning. In 2021 International Conference on Computational Intelligence and Computing Applications (ICCICA), 1–6.
  25. How multilingual is Multilingual BERT? arXiv:1906.01502.
  26. Know What You Don’t Know: Unanswerable Questions for SQuAD.
  27. SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv:1606.05250.
  28. Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages. Transactions of the Association for Computational Linguistics, 10: 145–162.
  29. CoQA: A Conversational Question Answering Challenge.
  30. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108.
  31. Database Creation for Marathi QA System.
  32. Attention Is All You Need. arXiv:1706.03762.

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube