Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

41 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

41 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

69 13

One-Shot Labeling for Automatic Relevance Estimation (2302.11266v2)

Published 22 Feb 2023 in cs.IR

Abstract: Dealing with unjudged documents ("holes") in relevance assessments is a perennial problem when evaluating search systems with offline experiments. Holes can reduce the apparent effectiveness of retrieval systems during evaluation and introduce biases in models trained with incomplete data. In this work, we explore whether LLMs can help us fill such holes to improve offline evaluations. We examine an extreme, albeit common, evaluation setting wherein only a single known relevant document per query is available for evaluation. We then explore various approaches for predicting the relevance of unjudged documents with respect to a query and the known relevant document, including nearest neighbor, supervised, and prompting techniques. We find that although the predictions of these One-Shot Labelers (1SL) frequently disagree with human assessments, the labels they produce yield a far more reliable ranking of systems than the single labels do alone. Specifically, the strongest approaches can consistently reach system ranking correlations of over 0.86 with the full rankings over a variety of measures. Meanwhile, the approach substantially increases the reliability of t-tests due to filling holes in relevance assessments, giving researchers more confidence in results they find to be significant. Alongside this work, we release an easy-to-use software package to enable the use of 1SL for evaluation of other ad-hoc collections or systems.

PDF HTML Abstract

Overview of One-Shot Labeling for Automatic Relevance Estimation

The paper "One-Shot Labeling for Automatic Relevance Estimation" by MacAvaney and Soldaini addresses a significant challenge in the evaluation of information retrieval systems—dealing with unjudged documents, commonly referred to as "holes" in relevance assessments. This challenge arises especially in offline experiments where the costs associated with fully judged test collections are often infeasible. The authors investigate whether LLMs can effectively fill these gaps, focusing on an extreme evaluation setting where only a single known relevant document per query is available.

Problem Context

In traditional information retrieval experiments, test collections are created wherein selected documents are judged for relevance concerning a set of queries. These judgments, however, are often incomplete due to the sheer volume of documents that need to be assessed, leading to biases and inaccuracies in evaluation results. Common strategies like shallow pools often lead to incomplete assessments that may affect systematic evaluations and the ability to reuse the collections for new systems. The authors propose an alternative approach using machine learning models to predict relevance for the unjudged documents, thereby aiming to enhance both the reliability and efficiency of offline evaluations.

Proposed Methods

The paper explores several "One-Shot Labelers" (1) methods for predicting relevance given a single known relevant document:

MaxRep: This method identifies the k-nearest neighbors of the known relevant document using both lexical (BM25) and semantic (TCT-ColBERT) similarities and treats them as relevant with a linear gain degradation.
DuoT5: A sequence-to-sequence model designed to evaluate the relative relevance of two documents with respect to a query, which is adapted here for one-shot relevance estimation based on comparisons.
DuoPrompt: Utilizes instruction-tuned models like Flan-T5 which can be prompted to perform the relevance estimation task directly with formulated instructions.

Results

Empirical evaluations using TREC Deep Learning Track datasets from 2019 to 2021 revealed that the proposed methods consistently achieved a high correlation with the full system rankings of the datasets. Notably, DuoPrompt demonstrated robust performance across several recall-agnostic measures like Precision and RBP, with system ranking correlations regularly surpassing 0.86. Despite these promising results, the methods were not reliable for exhaustive relevance estimation when considering recall measures. Furthermore, the one-shot labelers yielded more reliable statistical significance tests, addressing biases that arise from incomplete data.

Implications and Future Directions

The introduction of one-shot labeling techniques presents an opportunity to reduce the dependency on expensive manual labeling by leveraging the predictive powers of LLMs. This approach is likely to be particularly beneficial for precision-oriented evaluation measures, offering a practical alternative for evaluation settings with limited availability of human-assessed relevance labels.

Future work could address several open ends identified in this paper: Firstly, expanding the application scope of these methods to document retrieval tasks, which pose additional challenges due to larger context windows. Secondly, tackling multi-grade relevance assessments and improving the aggregation techniques for queries with multiple known relevant documents. Furthermore, extending these approaches to recall-sensitive measures without introducing biases is also a critical area for exploration.

In conclusion, the paper by MacAvaney and Soldaini demonstrates the potential for one-shot labeling to significantly enhance the evaluation of information retrieval systems by effectively filling relevance assessment holes. This work represents a step forward in efficiently managing the evaluation costs and ensuring more reliable retrieval system assessments.

PDF Markdown Bookmark Chat (Pro)

References (54)

Authors (2)

Sean MacAvaney (75 papers)
Luca Soldaini (62 papers)

Citations (39)

View on Semantic Scholar

GitHub

GitHub - seanmacavaney/autoqrels (13 stars)

Tweets

https://twitter.com/soldni/status/1775630257359589544

https://twitter.com/macavaney/status/1879076712375529860

https://twitter.com/macavaney/status/1788496219095855390

https://twitter.com/macavaney/status/1745734497914527836