Quati: A Brazilian Portuguese Information Retrieval Dataset from Native Speakers (2404.06976v1)
Abstract: Despite Portuguese being one of the most spoken languages in the world, there is a lack of high-quality information retrieval datasets in that language. We present Quati, a dataset specifically designed for the Brazilian Portuguese language. It comprises a collection of queries formulated by native speakers and a curated set of documents sourced from a selection of high-quality Brazilian Portuguese websites. These websites are frequented more likely by real users compared to those randomly scraped, ensuring a more representative and relevant corpus. To label the query-document pairs, we use a state-of-the-art LLM, which shows inter-annotator agreement levels comparable to human performance in our assessments. We provide a detailed description of our annotation methodology to enable others to create similar datasets for other languages, providing a cost-effective way of creating high-quality IR datasets with an arbitrary number of labeled documents per query. Finally, we evaluate a diverse range of open-source and commercial retrievers to serve as baseline systems. Quati is publicly available at https://huggingface.co/datasets/unicamp-dl/quati and all scripts at https://github.com/unicamp-dl/quati .
- Relevance assessment: are judges exchangeable and does it matter. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 667–674, 2008.
- mmarco: A multilingual version of the ms marco passage ranking dataset. arXiv preprint arXiv:2108.13897, 2021.
- Tydi qa: A benchmark for information-seeking question answering in ty pologically di verse languages. Transactions of the Association for Computational Linguistics, 8:454–470, 2020.
- Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 758–759, 2009.
- Trec deep learning track: Reusable test collections in the large data regime. In Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval, pages 2369–2375, 2021.
- Gauging the quality of relevance assessments using inter-rater agreement. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1089–1092, 2017.
- Perspectives on large language models for relevance judgment. arXiv preprint arXiv:2304.09161, 2023.
- An exam-based evaluation approach beyond traditional relevance judgments. arXiv preprint arXiv:2402.00309, 2024.
- Splade v2: Sparse lexical and expansion model for information retrieval. arXiv preprint arXiv:2109.10086, 2021.
- mrobust04: A multilingual version of the trec robust 2004 benchmark. arXiv preprint arXiv:2209.13738, 2022.
- Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2019.
- Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651, 2016.
- Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759, 2016.
- Hc4: A new suite of test collections for ad hoc clir. In European Conference on Information Retrieval, pages 351–366. Springer, 2022.
- Rcv1: A new benchmark collection for text categorization research. Journal of machine learning research, 5(Apr):361–397, 2004.
- Regis: A test collection for geoscientific documents in portuguese. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2363–2368, 2021.
- Pyserini: A Python toolkit for reproducible information retrieval research with sparse and dense representations. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021), pages 2356–2362, 2021.
- Transfer learning approaches for building cross-language dense retrieval models. In European Conference on Information Retrieval, pages 382–396. Springer, 2022.
- doccano: Text annotation tool for human, 2018. Software available from https://github.com/doccano/doccano.
- Ms marco: A human generated machine reading comprehension dataset. choice, 2640:660, 2016.
- Clueweb22: 10 billion web documents with rich information. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3360–3362, 2022.
- The importance of evaluation for cross-language system development: the clef experience. In LREC. Citeseer, 2002.
- Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
- Evaluating Information Retrieval and Access Tasks: NTCIR’s Legacy of Research Impact. Springer Nature, 2021.
- Cross-language information retrieval (clir) track overview. NIST SPECIAL PUBLICATION SP, pages 31–44, 1998.
- Large language models can accurately predict searcher preferences. arXiv preprint arXiv:2309.10621, 2023.
- Too many relevants: Whither cranfield test collections? In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2970–2980, 2022.
- Can old trec collections reliably evaluate modern neural retrieval models? arXiv preprint arXiv:2201.11086, 2022.
- Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022.
- Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc., 2022.
- mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934, 2020.
- Enhancing human annotation: Leveraging large language models and efficient batch processing. 2024.
- M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models. arXiv preprint arXiv:2306.05179, 2023.
- Miracl: A multilingual retrieval dataset covering 18 diverse languages. Transactions of the Association for Computational Linguistics, 11:1114–1131, 2023.