LLM-based relevance assessment still can't replace human relevance assessment

Published 22 Dec 2024 in cs.IR | (2412.17156v2)

Abstract: The use of LLMs for relevance assessment in information retrieval has gained significant attention, with recent studies suggesting that LLM-based judgments provide comparable evaluations to human judgments. Notably, based on TREC 2024 data, Upadhyay et al. make a bold claim that LLM-based relevance assessments, such as those generated by the UMBRELA system, can fully replace traditional human relevance assessments in TREC-style evaluations. This paper critically examines this claim, highlighting practical and theoretical limitations that undermine the validity of this conclusion. First, we question whether the evidence provided by Upadhyay et al. really supports their claim, particularly if a test collection is used asa benchmark for future improvements. Second, through a submission deliberately intended to do so, we demonstrate the ease with which automatic evaluation metrics can be subverted, showing that systems designed to exploit these evaluations can achieve artificially high scores. Theoretical challenges -- such as the inherent narcissism of LLMs, the risk of overfitting to LLM-based metrics, and the potential degradation of future LLM performance -- must be addressed before LLM-based relevance assessments can be considered a viable replacement for human judgments.

Abstract PDF HTML Upgrade to Chat

Authors (2)

Summary

The paper demonstrates that despite strong empirical correlations (e.g., Kendall’s τ of 0.89), LLM assessments have inherent theoretical and practical limitations compared to human judgment.
It reveals that LLM-based evaluations are vulnerable to manipulation, questioning their objectivity and consistency in ranking top-performing systems.
The study warns of risks like bias and Goodhart's Law effects, emphasizing the need to supplement automated metrics with human insight in information retrieval.

Analyzing the Capability of LLMs in Relevance Assessment

The paper "LLM-based relevance assessment still can't replace human relevance assessment" authored by Charles L. A. Clarke and Laura Dietz rigorously evaluates the claims regarding the potential of LLMs to substitute human relevance assessments completely, a topic of notable discussion within the field of information retrieval. The authors challenge significant assertions made in contemporary studies that advocate for LLM-based relevance assessments, such as those generated using the UMBRELA system, to replace human assessments in TREC-style evaluations. This paper takes a critical stance and identifies both practical and theoretical limitations that question the validity of such claims.

Main Arguments and Findings

Empirical Correlation vs. Theoretical Concerns: The authors highlight inconsistencies between empirical results and theoretical interpretations from existing studies, specifically referencing the work of Upadhyay et al., which reports a strong Kendall's $\tau$ correlation of 0.89 between LLM-based and manual assessments. Yet, it fails to surpass the evidence provided by ICTIR 2023, which also demonstrated strong correlations but advised caution over abandoning human assessments due to numerous unresolved concerns.
Vulnerability of Automatic Evaluations: The paper discusses how the existing LLM-based evaluations can be easily manipulated. Through experimental demonstrations, it illustrates how systems can artificially enhance their performance scores by exploiting automatic evaluation metrics. This aligns with the submission from Clarke's group which strategically manipulated the system to expose these vulnerabilities.
Mismatch in Evaluating Top-performing Systems: The study provides detailed insights into discrepancies between manual and automatic assessments among top-ranking systems. Errors in identifying peak performances signal fundamental limitations in effectively characterizing improvements in information retrieval methodologies when relying on LLM assessments alone.
Bias and Narcissism in LLM-based Assessments: By addressing issues such as LLM favoritism towards their own kind and the inherent biases they introduce, the authors provide a convincing argument against the claims of LLMs operating as objective evaluators. The inability of LLM-based assessments to fully replicate or replace human judgment is discussed with reference to different studies highlighting the narcissistic tendencies of LLM evaluations.
Long-term Implications and Goodhart's Law: The authors caution against a potential future dominated by LLM-based evaluations, noting that as systems become more sophisticated and algorithmically optimized to these automated metrics, the intended realism and human-centric utility may degrade - a phenomenon easily explained through Goodhart's Law.

Implications for the Future

The implications that arise from this research are considerable both in practice and theory. Practically, it necessitates a reevaluation and continued reliance on human relevance assessments as a safeguard against manipulation and ensures evaluation metrics are reflective of genuine advancements. Theoretically, this paper suggests a deeper inquiry into how LLMs can be used more reliably alongside human assessors to complement, rather than replace, human judgment in information retrieval tasks.

The critique advanced by Clarke and Dietz emphasizes the fundamental need to balance technological capabilities with human-centric approaches, ensuring that LLMs contribute to, rather than overshadow, the nuanced understanding that human insights bring to the field of relevance assessment in information retrieval. While LLMs hold promise in augmenting human efforts by providing scalable assessments, this paper decisively argues that the outright replacement of human input by LLMs remains untenable given current evidence.

Markdown Report Issue