Large language models can accurately predict searcher preferences (2309.10621v3)

Published 19 Sep 2023 in cs.IR, cs.AI, cs.CL, and cs.LG

Abstract: Relevance labels, which indicate whether a search result is valuable to a searcher, are key to evaluating and optimising search systems. The best way to capture the true preferences of users is to ask them for their careful feedback on which results would be useful, but this approach does not scale to produce a large number of labels. Getting relevance labels at scale is usually done with third-party labellers, who judge on behalf of the user, but there is a risk of low-quality data if the labeller doesn't understand user needs. To improve quality, one standard approach is to study real users through interviews, user studies and direct feedback, find areas where labels are systematically disagreeing with users, then educate labellers about user needs through judging guidelines, training and monitoring. This paper introduces an alternate approach for improving label quality. It takes careful feedback from real users, which by definition is the highest-quality first-party gold data that can be derived, and develops an LLM prompt that agrees with that data. We present ideas and observations from deploying LLMs for large-scale relevance labelling at Bing, and illustrate with data from TREC. We have found LLMs can be effective, with accuracy as good as human labellers and similar capability to pick the hardest queries, best runs, and best groups. Systematic changes to the prompts make a difference in accuracy, but so too do simple paraphrases. To measure agreement with real searchers needs high-quality "gold" labels, but with these we find that models produce better labels than third-party workers, for a fraction of the cost, and these labels let us train notably better rankers.

References (56)

Authors (4)

Paul Thomas (14 papers)
Seth Spielman (3 papers)
Nick Craswell (51 papers)
Bhaskar Mitra (78 papers)

Citations (102)

View on Semantic Scholar

Summary

The paper demonstrates that LLMs, when guided by detailed prompts, achieve human-level accuracy in relevance labelling tasks.
The paper utilizes GPT-4 and the TREC-Robust 2004 dataset to achieve Cohen’s kappa scores comparable to inter-human agreement.
The paper shows that LLMs offer a scalable, cost-effective, and rapid alternative to traditional human-based labelling in search engine optimization.

LLMs in Searcher Preference Prediction: An Analysis

The paper "LLMs Can Accurately Predict Searcher Preferences" explores the application of LLMs for relevance labelling in search systems, particularly focusing on the context of Microsoft’s Bing search engine. This research seeks to address common challenges associated with acquiring relevance labels necessary for evaluating and enhancing information retrieval systems. Traditionally, such labels have been sourced from real users, expert assessors, or crowd workers. However, this approach can suffer from scalability issues and potential biases due to misunderstandings of user intent by third-party labelers. This paper presents an innovative methodology utilizing LLMs to generate relevance labels, aiming for accuracy and efficiency improvements over conventional human-based labelling.

Main Contributions and Findings

A pivotal contribution of this research is the demonstration that LLMs, specifically GPT-4 with tailored prompts, can match and sometimes surpass human-level accuracy in relevance labelling tasks. The authors conducted experiments using the TREC-Robust 2004 data set, synthesizing various prompt configurations to determine optimal setups for aligning LLM output with gold standard labels. Remarkably, certain prompt designs improved Cohen’s κ (kappa) between LLM and human labels to a level comparable to inter-human agreement. It was found that precise prompt engineering, such as including detailed task instructions and aspect-based labelling, significantly influences the accuracy of LLM-generated labels. Additionally, across experiments, LLMs displayed a substantial ability to replicate preferences observed in first-party gold labels derived directly from searchers.

The research underscores the practical advantages of employing LLMs: high throughput, reduced costs, and expedited label generation. In scenarios such as Bing operations, LLMs outpaced traditional human labelling in terms of speed and scale, while also achieving noteworthy accuracy gains. The paper also highlights the potential for leveraging LLMs in environments where traditional human labelling is impractical due to either resource constraints or the need for immediate feedback. At Bing, the resultant integration of LLMs has facilitated more fine-grained and extensive evaluation capabilities, transforming the traditional dynamics of search engine optimization and experimentation.

Implications for Future Research and Practice

The implementation of LLMs in the field of relevance labelling introduces several implications for future research in AI and information retrieval:

Refinement of Prompt Engineering: The sensitivity of LLMs to prompt variations suggests further exploration into automated or semi-automated prompt optimization, potentially utilizing reinforcement learning techniques to dynamically adjust prompts based on feedback.
Integration Across Languages and Domains: While this paper focuses on English-language and general web search, extending LLM applications to diverse domains and languages remains an open research avenue, likely requiring adaptation of current models to understand cultural and domain-specific nuances.
Addressing Bias and Ethical Considerations: The use of LLMs necessitates continuous attention to potential biases encoded in the LLMs, necessitating the development of robust evaluation frameworks that ensure fairness and mitigate adverse biases.
Impact on Search Engine Dynamics: As LLMs become integral to search engine evaluation, understanding their impact on content ranking and retrieval strategies will be crucial. This includes studying potential feedback loops arising from optimizing searches based on LLM-produced labels.
Ecological and Resource Considerations: Given the computational demands of LLMs, ongoing studies must address their environmental impact, exploring avenues for reducing energy consumption through model optimization and efficient infrastructure utilization.

Conclusion

This paper illustrates how LLMs, through strategic application and prompt design, can revolutionize the process of acquiring relevance labels, offering a scalable and cost-effective alternative to traditional human-centred approaches. The transition towards AI-driven evaluation frameworks marks a significant shift in search engine development, underpinned by evidence that these models not only meet but often exceed the accuracy of conventional methods. Researchers and developers must now navigate the evolving landscape, addressing emerging challenges and harnessing the transformative potential LLMs present within information retrieval and beyond.

PDF Markdown

Related Papers

Tweets

https://twitter.com/rdeveaud/status/1784519227505803358

https://twitter.com/natzir9/status/1789680896288723327

https://twitter.com/spacemanidol/status/1801320042187133014

https://twitter.com/fighto/status/1749465949936263184

YouTube

Show All Videos