UMBRELA: UMbrela is the (Open-Source Reproduction of the) Bing RELevance Assessor (2406.06519v1)

Published 10 Jun 2024 in cs.IR

Abstract: Copious amounts of relevance judgments are necessary for the effective training and accurate evaluation of retrieval systems. Conventionally, these judgments are made by human assessors, rendering this process expensive and laborious. A recent study by Thomas et al. from Microsoft Bing suggested that LLMs can accurately perform the relevance assessment task and provide human-quality judgments, but unfortunately their study did not yield any reusable software artifacts. Our work presents UMBRELA (a recursive acronym that stands for UMbrela is the Bing RELevance Assessor), an open-source toolkit that reproduces the results of Thomas et al. using OpenAI's GPT-4o model and adds more nuance to the original paper. Across Deep Learning Tracks from TREC 2019 to 2023, we find that LLM-derived relevance judgments correlate highly with rankings generated by effective multi-stage retrieval systems. Our toolkit is designed to be easily extensible and can be integrated into existing multi-stage retrieval and evaluation pipelines, offering researchers a valuable resource for studying retrieval evaluation methodologies. UMBRELA will be used in the TREC 2024 RAG Track to aid in relevance assessments, and we envision our toolkit becoming a foundation for further innovation in the field. UMBRELA is available at https://github.com/castorini/umbrela.

PDF HTML Abstract

Overview of "UMBRELA: UMbrela is the (Open-Source Reproduction of the) Bing RELevance Assessor"

Introduction

The paper "UMBRELA: UMbrela is the (Open-Source Reproduction of the) Bing RELevance Assessor" introduces an open-source toolkit designed to automate the relevance assessment process traditionally conducted by human assessors. Leveraging the capabilities of LLMs, specifically OpenAI's GPT-4o, the paper not only replicates the experimentation by Thomas et al. from Microsoft Bing but also provides comprehensive validation and enhanced features. This research highlights the potential of LLMs in understanding search intent and labeling relevance, ultimately offering a cost-efficient alternative for manual relevance assessment in retrieval systems.

Methodology

The authors meticulously followed the zero-shot DNA (Descriptive, Narrative, and Aspects) prompting technique, aligning with the methodology described by Thomas et al. The toolkit, named UM (a recursive acronym for UMbrela is the Bing RELevance Assessor), processes a query and a set of passages to output relevance labels. It integrates easily into existing retrieval and evaluation pipelines, streamlining relevance assessment tasks. The paper leverages qrels from the TREC Deep Learning Tracks from 2019 to 2023 to validate the efficacy of the LLM-derived relevance judgments.

Experimental Setup

To ensure robustness, the authors performed experiments using the GPT-4o model via Microsoft Azure. They used a consistent set of parameters conforming to the guidelines of the original Bing paper, including temperature settings and penalty configurations. The extensive experimentation encompassed both four-scale and binary relevance labels, extracting human qrels and re-assessing them using UM for all datasets.

Results

The experimental findings reveal that LLMs can indeed serve as reliable relevance assessors. Cohen's $\kappa$ scores indicated fair agreement between human and LLM-generated labels, with higher consistency observed in binary assessments. Detailed evaluation through confusion matrices evidenced that LLMs accurately assigned non-relevant labels about 75% of the time and attained varying degrees of precision for other relevance levels. The paper presented high Kendall $\tau$ and Spearman $\rho$ correlations between retrieval systems' rankings using human and LLM judgments, underscoring the practical applicability of LLM assessments in ranking tasks.

Case Study and Analysis

In a case paper focusing on the TREC DL 2019 dataset, the authors highlighted instances where the LLM-based judgments provided more precise relevance labels compared to human assessors. This discrepancy often arose in queries with vague or ambiguous information needs. For instance, erroneous highly-relevant labels from human assessments were more accurately classified as non-relevant by the LLM, prompting reflections on the potential and limitations of both human and AI-driven relevance judgments.

Implications and Future Directions

This research underscores the implications of integrating LLMs into retrieval evaluation frameworks. Practically, it proposes a scalable, cost-effective alternative for relevance assessments, reducing dependency on manual labeling. Theoretically, it opens avenues for further refinement of AI models in understanding nuanced search intents and improving labeling precision. Future developments may focus on enhancing the interpretability of LLM judgments, exploring high-linguistic variance queries, and integrating multi-modal assessment capabilities.

Conclusion

The development and validation of UM provide pivotal insights into the potential of LLMs for relevance assessment. The toolkit's deployment in the upcoming TREC 2024 RAG Track will further demonstrate its utility, potentially setting a standard for future retrieval evaluation methodologies. By open-sourcing UM, the authors contribute substantially to the research community, facilitating advancements and innovations in automated relevance assessment.

The findings and toolkit presented in this paper signify a meaningful step towards leveraging artificial intelligence to automate and enhance the relevance assessment processes traditionally dominated by human judgment. The high correlation with human assessments posits LLMs as a promising technology for future retrieval systems.