Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Quati: A Brazilian Portuguese Information Retrieval Dataset from Native Speakers (2404.06976v1)

Published 10 Apr 2024 in cs.IR

Abstract: Despite Portuguese being one of the most spoken languages in the world, there is a lack of high-quality information retrieval datasets in that language. We present Quati, a dataset specifically designed for the Brazilian Portuguese language. It comprises a collection of queries formulated by native speakers and a curated set of documents sourced from a selection of high-quality Brazilian Portuguese websites. These websites are frequented more likely by real users compared to those randomly scraped, ensuring a more representative and relevant corpus. To label the query-document pairs, we use a state-of-the-art LLM, which shows inter-annotator agreement levels comparable to human performance in our assessments. We provide a detailed description of our annotation methodology to enable others to create similar datasets for other languages, providing a cost-effective way of creating high-quality IR datasets with an arbitrary number of labeled documents per query. Finally, we evaluate a diverse range of open-source and commercial retrievers to serve as baseline systems. Quati is publicly available at https://huggingface.co/datasets/unicamp-dl/quati and all scripts at https://github.com/unicamp-dl/quati .

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Relevance assessment: are judges exchangeable and does it matter. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 667–674, 2008.
  2. mmarco: A multilingual version of the ms marco passage ranking dataset. arXiv preprint arXiv:2108.13897, 2021.
  3. Tydi qa: A benchmark for information-seeking question answering in ty pologically di verse languages. Transactions of the Association for Computational Linguistics, 8:454–470, 2020.
  4. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 758–759, 2009.
  5. Trec deep learning track: Reusable test collections in the large data regime. In Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval, pages 2369–2375, 2021.
  6. Gauging the quality of relevance assessments using inter-rater agreement. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1089–1092, 2017.
  7. Perspectives on large language models for relevance judgment. arXiv preprint arXiv:2304.09161, 2023.
  8. An exam-based evaluation approach beyond traditional relevance judgments. arXiv preprint arXiv:2402.00309, 2024.
  9. Splade v2: Sparse lexical and expansion model for information retrieval. arXiv preprint arXiv:2109.10086, 2021.
  10. mrobust04: A multilingual version of the trec robust 2004 benchmark. arXiv preprint arXiv:2209.13738, 2022.
  11. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2019.
  12. Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651, 2016.
  13. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759, 2016.
  14. Hc4: A new suite of test collections for ad hoc clir. In European Conference on Information Retrieval, pages 351–366. Springer, 2022.
  15. Rcv1: A new benchmark collection for text categorization research. Journal of machine learning research, 5(Apr):361–397, 2004.
  16. Regis: A test collection for geoscientific documents in portuguese. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2363–2368, 2021.
  17. Pyserini: A Python toolkit for reproducible information retrieval research with sparse and dense representations. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021), pages 2356–2362, 2021.
  18. Transfer learning approaches for building cross-language dense retrieval models. In European Conference on Information Retrieval, pages 382–396. Springer, 2022.
  19. doccano: Text annotation tool for human, 2018. Software available from https://github.com/doccano/doccano.
  20. Ms marco: A human generated machine reading comprehension dataset. choice, 2640:660, 2016.
  21. Clueweb22: 10 billion web documents with rich information. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3360–3362, 2022.
  22. The importance of evaluation for cross-language system development: the clef experience. In LREC. Citeseer, 2002.
  23. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
  24. Evaluating Information Retrieval and Access Tasks: NTCIR’s Legacy of Research Impact. Springer Nature, 2021.
  25. Cross-language information retrieval (clir) track overview. NIST SPECIAL PUBLICATION SP, pages 31–44, 1998.
  26. Large language models can accurately predict searcher preferences. arXiv preprint arXiv:2309.10621, 2023.
  27. Too many relevants: Whither cranfield test collections? In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2970–2980, 2022.
  28. Can old trec collections reliably evaluate modern neural retrieval models? arXiv preprint arXiv:2201.11086, 2022.
  29. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022.
  30. Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc., 2022.
  31. mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934, 2020.
  32. Enhancing human annotation: Leveraging large language models and efficient batch processing. 2024.
  33. M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models. arXiv preprint arXiv:2306.05179, 2023.
  34. Miracl: A multilingual retrieval dataset covering 18 diverse languages. Transactions of the Association for Computational Linguistics, 11:1114–1131, 2023.
Citations (1)

Summary

  • The paper introduces Quati, a dataset enhancing Brazilian Portuguese IR by leveraging a semi-automated method with human and LLM input.
  • It utilizes the ClueWeb22 corpus and dual query generation to ensure relevance and linguistic authenticity.
  • The novel annotation process using LLMs demonstrates a scalable, cost-effective strategy for high-quality dataset creation.

Quati: Unveiling a Brazilian Portuguese Dataset for Information Retrieval

Introduction

The creation of Quati, a dataset aimed at enhancing information retrieval (IR) systems for the Brazilian Portuguese language, addresses a significant gap in the domain. Quati distinguishes itself by focusing on queries and documents that are authentically Brazilian, in contrast to the translated datasets or those that focus on English. This focus ensures that the linguistic nuances and cultural relevance inherent to Brazilian Portuguese are preserved, offering a more accurate and sensitive tool for IR system development.

Methodological Overview

The methodology employed in the creation of Quati is notable for its semi-automated approach, leveraging both human input and LLMs to curate a high-quality dataset. The dataset construction involved:

  • Data Collection and Preparation: Utilizing the ClueWeb22 corpus as a starting point, documents were filtered and segmented into passages. This process ensured that the dataset comprised content that was both contemporary and representative of a wide array of domains.
  • Query Generation: Two hundred queries were developed to reflect a broad spectrum of real-world information needs. These queries were split between those generated independently of the corpus to ensure variety and those derived from the corpus, guaranteeing at least one relevant document per query in the dataset.
  • Passage Retrieval: A diverse array of IR systems were utilized to ensure that the retrieved documents for each query covered a broad relevance spectrum. This approach not only enriched the dataset but also laid the groundwork for its effectiveness in evaluating retrieval systems.
  • Annotation through LLMs: The novel use of LLMs to annotate query-document relevance represents a cost-effective strategy for dataset creation. Despite the slightly lower agreement levels with human annotators compared to human-human benchmarks, this methodology demonstrates significant potential for scalability and cost reduction in dataset creation.

Implications and Prospects

The Quati dataset stands as a significant contribution to the development of IR systems tailored to the Brazilian Portuguese language. The careful curatorial approach ensures the dataset's relevance and utility in both theoretical and practical realms of IR research. Specifically, Quati facilitates:

  • Cultural and Linguistic Specificity in IR: By focusing on native content and queries, Quati enables the development of IR systems that are finely tuned to the linguistic and cultural nuances of Brazilian Portuguese, a feature often lost in translation-based datasets.
  • Benchmarking and Evaluation: Offering a robust framework for evaluating the efficacy of various IR systems, Quati serves as a critical tool for benchmarking and advancing IR technologies in the context of Brazilian Portuguese.
  • Future Dataset Creation: The semi-automated methodology highlighted in Quati's development offers a blueprint for creating similar high-quality IR datasets for other languages, potentially revolutionizing the landscape for non-English IR research.
  • LLM Enhancement and Utilization: The use of LLMs for dataset annotation not only underscores the models' evolving capabilities in understanding and processing language but also points to future improvements and applications of LLMs in dataset creation and beyond.

Conclusion

Quati emerges as a pivotal resource for the evolution of information retrieval systems within the context of Brazilian Portuguese. Its creation methodology, balancing human insight with the efficiency of LLMs, sets a precedent for future endeavors in dataset development across various languages. As IR systems continue to advance, the need for diverse, culturally specific datasets will only grow, making contributions like Quati both valuable and necessary. The potential for refining LLM utilization in dataset annotation and expanding the Quati dataset further underscores the ongoing relevance and utility of this work in the field of IR research.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com