Synthetic Test Collections for Retrieval Evaluation (2405.07767v1)

Published 13 May 2024 in cs.IR and cs.AI

Abstract: Test collections play a vital role in evaluation of information retrieval (IR) systems. Obtaining a diverse set of user queries for test collection construction can be challenging, and acquiring relevance judgments, which indicate the appropriateness of retrieved documents to a query, is often costly and resource-intensive. Generating synthetic datasets using LLMs has recently gained significant attention in various applications. In IR, while previous work exploited the capabilities of LLMs to generate synthetic queries or documents to augment training data and improve the performance of ranking models, using LLMs for constructing synthetic test collections is relatively unexplored. Previous studies demonstrate that LLMs have the potential to generate synthetic relevance judgments for use in the evaluation of IR systems. In this paper, we comprehensively investigate whether it is possible to use LLMs to construct fully synthetic test collections by generating not only synthetic judgments but also synthetic queries. In particular, we analyse whether it is possible to construct reliable synthetic test collections and the potential risks of bias such test collections may exhibit towards LLM-based models. Our experiments indicate that using LLMs it is possible to construct synthetic test collections that can reliably be used for retrieval evaluation.

PDF Abstract

Synthetic Test Collections for Retrieval Evaluation

Introduction

The paper discusses the creation and efficacy of synthetic test collections using LLMs for evaluating Information Retrieval (IR) systems. Traditionally, constructing a test collection involves considerable human effort. This includes manually generated queries and relevance judgments, impacting the scalability and feasibility of the process, especially outside large organizations with access to rich user query logs.

The authors endeavor to determine if it's possible and effective to use LLMs to autocreate both the queries and their corresponding relevance judgments. If successful, this could streamline the creation of large, diverse test collections minimally dependent on manual efforts.

Utilizing LLMs for Query and Judgment Generation

Synthetic Query Creation:

The paper underlines a method for generating queries synthetically, crucial for building reliable test collections. Here’s a breakdown of the process:

Passage Sampling: The model begins by sampling passages from a corpus, which act as seeds for generating search queries.
Query Generation: Utilizing LLMs like GPT-4 and T5, the system autogenerates queries based on predefined seeds. Each approach produces variations in average query length—observations indicated that GPT-4 generates notably longer queries.
Expert Review: Post-generation, these queries undergo a filtering process by experts to ensure only the most reasonable are retained, forming a blend of technology-driven generation and human oversight.

Synthetic Relevance Judgment:

Besides creating queries, the paper investigates synthetic generation of relevance judgments:

Sparse Judgments Method: Initially, it tests a straightforward approach where only the passages used to generate the queries are assumed relevant—a sparse judgment.
LLMs in Relevance Judging: More critically, they employ GPT-4 to auto-generate relevance judgments across a vast document set, comparing its accuracy against human judgments. Although there's a noted variance in rating exact relevance levels, the overall agreement suggests a promising potential for LLMs in this function.

Experimental Insights and Results

The paper presents robust experimental feedback:

Comparison with Traditional Methods: The synthetic test collections, when applied in actual IR system evaluations, delivered results comparable to those obtained from traditionally crafted test collections, indicating their potential reliability.
System Performance Analysis: The effectiveness of IR systems when evaluated against both real and synthetic test collections showed a high level of agreement. This supports the feasibility of using synthetic collections for realistic IR system evaluation.

Addressing Bias Concerns

A vital aspect explored in the research is the inherent bias of synthetic test collections towards specific models, particularly those similar to ones used for generating the collection itself. The detailed analysis showed minimal to no preferential bias towards systems based on the same LLM, challenging concerns that synthetic test collections might unfairly advantage certain retrieval models.

Future Directions and Conclusion

With the promising results demonstrated, the paper suggests that further refinements and experiments could enhance the reliability and applicability of synthetic test collections. Future research could focus on exploring various LLM configurations and advanced prompting strategies to optimize both the generation process and the resultant test collection's quality.

Additionally, an ongoing evaluation of potential biases and the development of mitigation strategies are recommended to ensure the fairness and utility of this innovative approach in broader IR system evaluation contexts.

In summary, the paper opens up a new avenue in IR research, providing a scalable alternative to manual test collection construction with a noteworthy degree of effectiveness and efficiency.