LLMs Can Patch Up Missing Relevance Judgments in Evaluation (2405.04727v1)

Published 8 May 2024 in cs.IR

Abstract: Unjudged documents or holes in information retrieval benchmarks are considered non-relevant in evaluation, yielding no gains in measuring effectiveness. However, these missing judgments may inadvertently introduce biases into the evaluation as their prevalence for a retrieval model is heavily contingent on the pooling process. Thus, filling holes becomes crucial in ensuring reliable and accurate evaluation. Collecting human judgment for all documents is cumbersome and impractical. In this paper, we aim at leveraging LLMs to automatically label unjudged documents. Our goal is to instruct an LLM using detailed instructions to assign fine-grained relevance judgments to holes. To this end, we systematically simulate scenarios with varying degrees of holes by randomly dropping relevant documents from the relevance judgment in TREC DL tracks. Our experiments reveal a strong correlation between our LLM-based method and ground-truth relevance judgments. Based on our simulation experiments conducted on three TREC DL datasets, in the extreme scenario of retaining only 10% of judgments, our method achieves a Kendall tau correlation of 0.87 and 0.92 on an average for Vicu~na-7B and GPT-3.5 Turbo respectively.

PDF HTML Abstract

Exploring the Use of LLMs for Filling "Holes" in Information Retrieval Benchmarks

Background on Information Retrieval Evaluation

Information Retrieval (IR) is a field that focuses on the retrieval of relevant information from large datasets. For effective evaluation, benchmarks typically use a set of queries against a text corpus and check how well a retrieval model can identify relevant documents. However, due to growing data sizes and complexity, it's impractical to assess every document's relevance manually, leading to unjudged documents, or "holes". These holes can potentially introduce biases when they are automatically considered as non-relevant because they affect how retrieval models are scored.

The Problem of Holes

The dilemma revolves around the limitations in traditional evaluation methods that cannot feasibly handle the sheer scale of documents. Recent strategies involve automated methods to plug these gaps to maintain fairness and accuracy in system evaluations. Such methods are crucial since these holes can misrepresent the capabilities of newer or diverse retrieval models, potentially stifling innovation and progress in IR systems by skewing results towards familiar, well-trodden models.

Our Innovative Approach Using LLMs

The paper presents an intriguing methodology to use LLMs to automatically assign relevance judgments to the unjudged documents in IR benchmarks. This approach leverages the instructional abilities of models such as Vicuña-7B and Turbo to interpret and classify the relevance of text data concerning a specific query.

Core Methodology:

Scenario Simulation: By simulating different scenarios where relevant documents are randomly removed from the dataset, a realistic and stringent test environment is created.
Training LLMs: The LLMs are provided with specific examples and detailed guidance on assigning relevance levels (from completely irrelevant to perfectly relevant).
Accuracy and Validation: Comparatively validating the LLM-generated judgments against the original "ground-truth" judgments to measure accuracy.

Experimental Insights and Results

The experiments demonstrated not only the feasibility of this approach but also its robustness. Particularly:

In the most extreme simulation of retaining only 10% of human judgments, the model achieved a high Kendall τ correlation between 0.87 and 0.92, reflecting strong agreement with human assessments using Vicuña-7B and Turbo.
These results highlight that LLMs, both proprietary and open-source, can successfully extrapolate from limited data to make accurate relevance judgments.

Implications and Future Directions

Practical Implications:

This method could dramatically reduce the human labor and time required in generating IR benchmarks, leading to more frequent updates and potentially more accurate and fair evaluations of retrieval systems.

Theoretical Implications:

The success of LLMs in this arena supports the hypothesis that they can comprehend and process complex, abstract instructions in specialized domains, adapting their general linguistic capabilities to specific tasks.

Speculation on Future AI Developments:

Automated Evaluation Systems: Fully automated systems could become standard for initial evaluations in IR benchmarks.
Expansion Across Fields: This methodology has potential applications in other areas requiring content evaluation, like content moderation or recommendation systems.
Improving LLM Training: Future research might focus on refining LLM training methodologies to enhance their sensitivity to the nuances of relevance judgment without direct human input.

Conclusion

The use of LLMs to fill evaluation holes in IR benchmarks represents a significant step towards more scalable, accurate, and unbiased systems. By effectively harnessing the capabilities of LLMs, researchers can ensure that advancements in the field are truly reflective of a model's capacity to retrieve relevant information, not just its ability to play to the test's benchmarks.