Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Large language models can accurately predict searcher preferences (2309.10621v3)

Published 19 Sep 2023 in cs.IR, cs.AI, cs.CL, and cs.LG

Abstract: Relevance labels, which indicate whether a search result is valuable to a searcher, are key to evaluating and optimising search systems. The best way to capture the true preferences of users is to ask them for their careful feedback on which results would be useful, but this approach does not scale to produce a large number of labels. Getting relevance labels at scale is usually done with third-party labellers, who judge on behalf of the user, but there is a risk of low-quality data if the labeller doesn't understand user needs. To improve quality, one standard approach is to study real users through interviews, user studies and direct feedback, find areas where labels are systematically disagreeing with users, then educate labellers about user needs through judging guidelines, training and monitoring. This paper introduces an alternate approach for improving label quality. It takes careful feedback from real users, which by definition is the highest-quality first-party gold data that can be derived, and develops an LLM prompt that agrees with that data. We present ideas and observations from deploying LLMs for large-scale relevance labelling at Bing, and illustrate with data from TREC. We have found LLMs can be effective, with accuracy as good as human labellers and similar capability to pick the hardest queries, best runs, and best groups. Systematic changes to the prompts make a difference in accuracy, but so too do simple paraphrases. To measure agreement with real searchers needs high-quality "gold" labels, but with these we find that models produce better labels than third-party workers, for a fraction of the cost, and these labels let us train notably better rankers.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Good, neutral or bad news classification. In Proceedings of the Third International Workshop on Recent Trends in News Information Retrieval. 9–14.
  2. Open-source large language models outperform crowd workers and approach ChatGPT in text-annotation tasks. arXiv:2307.02179 [cs.CL]
  3. Relevance Assessment: Are Judges Exchangeable and Does It Matter. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 667–674.
  4. On the dangers of stochastic parrots: Can language models be too big?. In Proceedings of the ACM Conference on Fairness, Accountability, and Transparency.
  5. Language (technology) is power: A critical survey of “bias”’ in NLP. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 5454–5476.
  6. Do people and neural nets pay attention to the same words: studying eye-tracking data for non-factoid QA evaluation. In Proceedings of the ACM International Conference on Information and Knowledge Management. 85–94.
  7. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. Advances in neural information processing systems 29 (2016).
  8. On the opportunities and risks of foundation models. arXiv:2108.07258 [cs.LG]
  9. Andrei Broder. 2002. A taxonomy of web search. In ACM Sigir forum, Vol. 36. ACM New York, NY, USA, 3–10.
  10. Jake Brutlag. 2009. Speed matters for Google web search. Online: https://services.google.com/fh/files/blogs/google_delayexp.pdf. Downloaded 2023-09-14..
  11. Semantics derived automatically from language corpora contain human-like biases. Science 356, 6334 (2017), 183–186.
  12. Here or there: Preference judgments for relevance. In Proceedings of the European Conference on Information Retrieval. 16–27.
  13. A reference collection for web spam. SIGIR Forum 40, 2 (Dec. 2006), 11–24.
  14. K. Alec Chrystal and Paul D. Mizen. 2001. Goodhart’s law: Its origins, meaning and implications for monetary policy. Prepared for the Festschrift in honour of Charles Goodhart.
  15. HMC: A spectrum of human–machine-collaborative relevance judgment frameworks. In Frontiers of Information Access Experimentation for Research and Education, Christine Bauer, Ben Carterette, Nicola Ferro, and Norbert Fuhr (Eds.). Vol. 13. Leibniz-Zentrum für Informatik. Issue 1.
  16. Examining the limits of crowdsourcing for relevance assessment. IEEE Internet Computing 17, 4 (2013).
  17. Efficient construction of large test collections. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 282–289.
  18. Gauging the quality of relevance assessments using inter-rater agreement. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval.
  19. Measuring the carbon intensity of AI in cloud instances. In Proceedings of the ACM Conference on Fairness, Accountability, and Transparency. 1877–1894.
  20. Understanding user behavior through log data and analysis. In Ways of knowing in HCI, Judith S. Olson and Wendy A. Kellogg (Eds.). Springer, New York, 349–372.
  21. Perspectives on large language models for relevance judgment. arXiv:2304.09161 [cs.IR]
  22. ChatGPT outperforms crowd-workers for text-annotation tasks. arXiv:2303.15056 [cs.CL]
  23. Hila Gonen and Yoav Goldberg. 2019. Lipstick on a pig: Debiasing methods cover up systematic gender biases in word embeddings but do not remove them. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 609–614.
  24. Charles A E Goodhart. 1975. Problems of monetary management: The UK experience. In Papers in Monetary Economics. Vol. 1. Reserve Bank of Australia.
  25. Google LLC. 2022. General Guidelines. https://guidelines.raterhub.com/searchqualityevaluatorguidelines.pdf, Downloaded 29 July 2023..
  26. OHSUMED: An interactive retrieval evaluation and new large test collection for research. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 192–201.
  27. Distilling the Knowledge in a Neural Network. In NIPS Deep Learning and Representation Learning Workshop. http://arxiv.org/abs/1503.02531
  28. Local self-attention over long text for efficient document retrieval. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 2021–2024.
  29. Keith Hoskin. 1996. The ‘awful’ idea of accountability: Inscribing people into the measurement of objects. In Accountability: Power, ethos and technologies of managing, R Munro and J Mouritsen (Eds.). International Thompson Business Press, London.
  30. Collect, measure, repeat: Reliability factors for responsible AI data collection. arXiv:2308.12885 [cs.LG]
  31. Andrej Karpathy. 2023. State of GPT. Seminar at Microsoft Build. https://build.microsoft.com/en-US/sessions/db3f4859-cd30-4445-a0cd-553c3304f8e2.
  32. Less is Less: When are Snippets Insufficient for Human vs Machine Relevance Estimation?. In Proceedings of the European Conference on Information Retrieval. 153–162.
  33. Large Language Models are Zero-Shot Reasoners. arXiv:2205.11916 [cs.CL]
  34. Holistic evaluation of language models. arXiv:2211.09110 [cs.CL]
  35. Tie-Yan Liu. 2009. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval 3, 3 (2009), 225–331.
  36. Safiya Umoja Noble. 2018. Algorithms of oppression. In Algorithms of oppression. New York University Press.
  37. OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
  38. The carbon footprint of machine learning training will plateau, then shrink. Computer 55, 7 (2022), 18–28.
  39. Carbon emissions and large neural network training. (2021). arXiv:2104.10350 [cs.LG]
  40. Automatic prompt optimization with “gradient descent” and beam search. arXiv:2305.03495
  41. Tefko Saracevic. 2008. Effects of inconsistent relevance judgments on information retrieval test results: A historical perspective. Library Trends 56, 4 (2008), 763–783.
  42. The effect of threshold priming and need for cognition on relevance calibration and assessment. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 623–632.
  43. Eric Schurman and Jake Brutlag. 2009. Performance related changes and their user impact. In Velocity web performance and operations conference.
  44. Latanya Sweeney. 2013. Discrimination in online ad delivery. Commun. ACM 56, 5 (2013), 44–54.
  45. The crowd is made of people: Observations from large-scale crowd labelling. In Proceedings of the Conference on Human Information Interaction and Retrieval.
  46. Rachel L. Thomas and David Uminsky. 2022. Reliance on metrics is a fundamental challenge for AI. Patterns 3, 5 (2022).
  47. Petter Törnberg. 2023. ChatGPT-4 outperforms experts and crowd workers in annotating political Twitter messages with zero-shot learning. arXiv:2304.06588 [cs.CL]
  48. Ellen M Voorhees. 1998. Variations in relevance judgments and the measurement of retrieval effectiveness. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 315–323.
  49. Ellen M Voorhees. 2004. Overview of the TREC 2004 Robust Retrieval Track. In Proceedings of the Text REtrieval Conference.
  50. How far can camels go? Exploring the state of instruction tuning on open resources. arXiv:2306.04751 [cs.CL]
  51. A Similarity Measure for Indefinite Rankings. ACM Transactions on Information Systems 28, 4, Article 20 (Nov. 2010).
  52. Chain-of-thought prompting elicits reasoning in large language models. arXiv:2201.11903 [cs.CL]
  53. Sustainable AI: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4 (2022), 795–813.
  54. Large language models as optimisers. arXiv:2309.03409 [cs.LG]
  55. TEMPERA: Test-time prompt editing via reinforcement learning. arXiv:2211.11890 [cs.CL]
  56. Large language models are human-level prompt engineers. arXiv:2211.01910 [cs.LG]
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Paul Thomas (14 papers)
  2. Seth Spielman (3 papers)
  3. Nick Craswell (51 papers)
  4. Bhaskar Mitra (78 papers)
Citations (102)

Summary

  • The paper demonstrates that LLMs, when guided by detailed prompts, achieve human-level accuracy in relevance labelling tasks.
  • The paper utilizes GPT-4 and the TREC-Robust 2004 dataset to achieve Cohen’s kappa scores comparable to inter-human agreement.
  • The paper shows that LLMs offer a scalable, cost-effective, and rapid alternative to traditional human-based labelling in search engine optimization.

LLMs in Searcher Preference Prediction: An Analysis

The paper "LLMs Can Accurately Predict Searcher Preferences" explores the application of LLMs for relevance labelling in search systems, particularly focusing on the context of Microsoft’s Bing search engine. This research seeks to address common challenges associated with acquiring relevance labels necessary for evaluating and enhancing information retrieval systems. Traditionally, such labels have been sourced from real users, expert assessors, or crowd workers. However, this approach can suffer from scalability issues and potential biases due to misunderstandings of user intent by third-party labelers. This paper presents an innovative methodology utilizing LLMs to generate relevance labels, aiming for accuracy and efficiency improvements over conventional human-based labelling.

Main Contributions and Findings

A pivotal contribution of this research is the demonstration that LLMs, specifically GPT-4 with tailored prompts, can match and sometimes surpass human-level accuracy in relevance labelling tasks. The authors conducted experiments using the TREC-Robust 2004 data set, synthesizing various prompt configurations to determine optimal setups for aligning LLM output with gold standard labels. Remarkably, certain prompt designs improved Cohen’s κ (kappa) between LLM and human labels to a level comparable to inter-human agreement. It was found that precise prompt engineering, such as including detailed task instructions and aspect-based labelling, significantly influences the accuracy of LLM-generated labels. Additionally, across experiments, LLMs displayed a substantial ability to replicate preferences observed in first-party gold labels derived directly from searchers.

The research underscores the practical advantages of employing LLMs: high throughput, reduced costs, and expedited label generation. In scenarios such as Bing operations, LLMs outpaced traditional human labelling in terms of speed and scale, while also achieving noteworthy accuracy gains. The paper also highlights the potential for leveraging LLMs in environments where traditional human labelling is impractical due to either resource constraints or the need for immediate feedback. At Bing, the resultant integration of LLMs has facilitated more fine-grained and extensive evaluation capabilities, transforming the traditional dynamics of search engine optimization and experimentation.

Implications for Future Research and Practice

The implementation of LLMs in the field of relevance labelling introduces several implications for future research in AI and information retrieval:

  1. Refinement of Prompt Engineering: The sensitivity of LLMs to prompt variations suggests further exploration into automated or semi-automated prompt optimization, potentially utilizing reinforcement learning techniques to dynamically adjust prompts based on feedback.
  2. Integration Across Languages and Domains: While this paper focuses on English-language and general web search, extending LLM applications to diverse domains and languages remains an open research avenue, likely requiring adaptation of current models to understand cultural and domain-specific nuances.
  3. Addressing Bias and Ethical Considerations: The use of LLMs necessitates continuous attention to potential biases encoded in the LLMs, necessitating the development of robust evaluation frameworks that ensure fairness and mitigate adverse biases.
  4. Impact on Search Engine Dynamics: As LLMs become integral to search engine evaluation, understanding their impact on content ranking and retrieval strategies will be crucial. This includes studying potential feedback loops arising from optimizing searches based on LLM-produced labels.
  5. Ecological and Resource Considerations: Given the computational demands of LLMs, ongoing studies must address their environmental impact, exploring avenues for reducing energy consumption through model optimization and efficient infrastructure utilization.

Conclusion

This paper illustrates how LLMs, through strategic application and prompt design, can revolutionize the process of acquiring relevance labels, offering a scalable and cost-effective alternative to traditional human-centred approaches. The transition towards AI-driven evaluation frameworks marks a significant shift in search engine development, underpinned by evidence that these models not only meet but often exceed the accuracy of conventional methods. Researchers and developers must now navigate the evolving landscape, addressing emerging challenges and harnessing the transformative potential LLMs present within information retrieval and beyond.

Youtube Logo Streamline Icon: https://streamlinehq.com