Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

IndicIRSuite: Multilingual Dataset and Neural Information Models for Indian Languages (2312.09508v1)

Published 15 Dec 2023 in cs.IR and cs.CL

Abstract: In this paper, we introduce Neural Information Retrieval resources for 11 widely spoken Indian Languages (Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu) from two major Indian language families (Indo-Aryan and Dravidian). These resources include (a) INDIC-MARCO, a multilingual version of the MSMARCO dataset in 11 Indian Languages created using Machine Translation, and (b) Indic-ColBERT, a collection of 11 distinct Monolingual Neural Information Retrieval models, each trained on one of the 11 languages in the INDIC-MARCO dataset. To the best of our knowledge, IndicIRSuite is the first attempt at building large-scale Neural Information Retrieval resources for a large number of Indian languages, and we hope that it will help accelerate research in Neural IR for Indian Languages. Experiments demonstrate that Indic-ColBERT achieves 47.47% improvement in the MRR@10 score averaged over the INDIC-MARCO baselines for all 11 Indian languages except Oriya, 12.26% improvement in the NDCG@10 score averaged over the MIRACL Bengali and Hindi Language baselines, and 20% improvement in the MRR@100 Score over the Mr.Tydi Bengali Language baseline. IndicIRSuite is available at https://github.com/saifulhaq95/IndicIRSuite

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. mmarco: A multilingual version of the ms marco passage ranking dataset. arXiv preprint arXiv:2108.13897 (2021).
  2. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672 (2022).
  3. Indicbart: A pre-trained model for indic natural language generation. arXiv preprint arXiv:2109.02903 (2021).
  4. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  5. Cross-lingual knowledge transfer via distillation for multilingual information retrieval, 2023.
  6. Improving cross-lingual information retrieval on low-resource languages via optimal transport distillation. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining (feb 2023), ACM.
  7. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906 (2020).
  8. The opennmt neural machine translation toolkit: 2020 edition. In Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track) (2020), pp. 102–109.
  9. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics 7 (2019), 452–466.
  10. Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291 (2019).
  11. Neural approaches to multilingual information retrieval. In European Conference on Information Retrieval (2023), Springer, pp. 521–536.
  12. Simple yet effective neural ranking and reranking baselines for cross-lingual information retrieval. arXiv preprint arXiv:2304.01019 (2023).
  13. Ms marco: A human generated machine reading comprehension dataset. choice 2640 (2016), 660.
  14. Overview of fire 2011. In Multilingual Information Access in South Asian Languages: Second International Workshop, FIRE 2010, Gandhinagar, India, February 19-21, 2010 and Third International Workshop, FIRE 2011, Bombay, India, December 2-4, 2011, Revised Selected Papers (2013), Springer, pp. 1–12.
  15. Samanantar: The largest publicly available parallel corpora collection for 11 indic languages. Transactions of the Association for Computational Linguistics 10 (2022), 145–162.
  16. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval 3, 4 (2009), 333–389.
  17. Colbertv2: Effective and efficient retrieval via lightweight late interaction. arXiv preprint arXiv:2112.01488 (2021).
  18. Clirmatrix: A massively large collection of bilingual and multilingual datasets for cross-lingual information retrieval. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2020), pp. 4160–4170.
  19. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663 (2021).
  20. Mr. tydi: A multi-lingual benchmark for dense retrieval. arXiv preprint arXiv:2108.08787 (2021).
  21. Towards best practices for training multilingual dense retrieval models, 2022.
  22. Making a miracl: Multilingual information retrieval across a continuum of languages. arXiv preprint arXiv:2210.09984 (2022).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Saiful Haq (3 papers)
  2. Ashutosh Sharma (17 papers)
  3. Pushpak Bhattacharyya (153 papers)
Citations (2)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com