Quati: A Brazilian Portuguese Information Retrieval Dataset from Native Speakers (2404.06976v1)

Published 10 Apr 2024 in cs.IR

Abstract: Despite Portuguese being one of the most spoken languages in the world, there is a lack of high-quality information retrieval datasets in that language. We present Quati, a dataset specifically designed for the Brazilian Portuguese language. It comprises a collection of queries formulated by native speakers and a curated set of documents sourced from a selection of high-quality Brazilian Portuguese websites. These websites are frequented more likely by real users compared to those randomly scraped, ensuring a more representative and relevant corpus. To label the query-document pairs, we use a state-of-the-art LLM, which shows inter-annotator agreement levels comparable to human performance in our assessments. We provide a detailed description of our annotation methodology to enable others to create similar datasets for other languages, providing a cost-effective way of creating high-quality IR datasets with an arbitrary number of labeled documents per query. Finally, we evaluate a diverse range of open-source and commercial retrievers to serve as baseline systems. Quati is publicly available at https://huggingface.co/datasets/unicamp-dl/quati and all scripts at https://github.com/unicamp-dl/quati .

References (34)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces Quati, a dataset enhancing Brazilian Portuguese IR by leveraging a semi-automated method with human and LLM input.
It utilizes the ClueWeb22 corpus and dual query generation to ensure relevance and linguistic authenticity.
The novel annotation process using LLMs demonstrates a scalable, cost-effective strategy for high-quality dataset creation.

Quati: Unveiling a Brazilian Portuguese Dataset for Information Retrieval

Introduction

The creation of Quati, a dataset aimed at enhancing information retrieval (IR) systems for the Brazilian Portuguese language, addresses a significant gap in the domain. Quati distinguishes itself by focusing on queries and documents that are authentically Brazilian, in contrast to the translated datasets or those that focus on English. This focus ensures that the linguistic nuances and cultural relevance inherent to Brazilian Portuguese are preserved, offering a more accurate and sensitive tool for IR system development.

Methodological Overview

The methodology employed in the creation of Quati is notable for its semi-automated approach, leveraging both human input and LLMs to curate a high-quality dataset. The dataset construction involved:

Data Collection and Preparation: Utilizing the ClueWeb22 corpus as a starting point, documents were filtered and segmented into passages. This process ensured that the dataset comprised content that was both contemporary and representative of a wide array of domains.
Query Generation: Two hundred queries were developed to reflect a broad spectrum of real-world information needs. These queries were split between those generated independently of the corpus to ensure variety and those derived from the corpus, guaranteeing at least one relevant document per query in the dataset.
Passage Retrieval: A diverse array of IR systems were utilized to ensure that the retrieved documents for each query covered a broad relevance spectrum. This approach not only enriched the dataset but also laid the groundwork for its effectiveness in evaluating retrieval systems.
Annotation through LLMs: The novel use of LLMs to annotate query-document relevance represents a cost-effective strategy for dataset creation. Despite the slightly lower agreement levels with human annotators compared to human-human benchmarks, this methodology demonstrates significant potential for scalability and cost reduction in dataset creation.

Implications and Prospects

The Quati dataset stands as a significant contribution to the development of IR systems tailored to the Brazilian Portuguese language. The careful curatorial approach ensures the dataset's relevance and utility in both theoretical and practical realms of IR research. Specifically, Quati facilitates:

Cultural and Linguistic Specificity in IR: By focusing on native content and queries, Quati enables the development of IR systems that are finely tuned to the linguistic and cultural nuances of Brazilian Portuguese, a feature often lost in translation-based datasets.
Benchmarking and Evaluation: Offering a robust framework for evaluating the efficacy of various IR systems, Quati serves as a critical tool for benchmarking and advancing IR technologies in the context of Brazilian Portuguese.
Future Dataset Creation: The semi-automated methodology highlighted in Quati's development offers a blueprint for creating similar high-quality IR datasets for other languages, potentially revolutionizing the landscape for non-English IR research.
LLM Enhancement and Utilization: The use of LLMs for dataset annotation not only underscores the models' evolving capabilities in understanding and processing language but also points to future improvements and applications of LLMs in dataset creation and beyond.

Conclusion

Quati emerges as a pivotal resource for the evolution of information retrieval systems within the context of Brazilian Portuguese. Its creation methodology, balancing human insight with the efficiency of LLMs, sets a precedent for future endeavors in dataset development across various languages. As IR systems continue to advance, the need for diverse, culturally specific datasets will only grow, making contributions like Quati both valuable and necessary. The potential for refining LLM utilization in dataset annotation and expanding the Quati dataset further underscores the ongoing relevance and utility of this work in the field of IR research.

PDF Markdown

Related Papers

Tweets

https://twitter.com/rodrigfnogueira/status/1778393303198654654

YouTube

Show All Videos