Large language models can accurately predict searcher preferences (2309.10621v3)
Abstract: Relevance labels, which indicate whether a search result is valuable to a searcher, are key to evaluating and optimising search systems. The best way to capture the true preferences of users is to ask them for their careful feedback on which results would be useful, but this approach does not scale to produce a large number of labels. Getting relevance labels at scale is usually done with third-party labellers, who judge on behalf of the user, but there is a risk of low-quality data if the labeller doesn't understand user needs. To improve quality, one standard approach is to study real users through interviews, user studies and direct feedback, find areas where labels are systematically disagreeing with users, then educate labellers about user needs through judging guidelines, training and monitoring. This paper introduces an alternate approach for improving label quality. It takes careful feedback from real users, which by definition is the highest-quality first-party gold data that can be derived, and develops an LLM prompt that agrees with that data. We present ideas and observations from deploying LLMs for large-scale relevance labelling at Bing, and illustrate with data from TREC. We have found LLMs can be effective, with accuracy as good as human labellers and similar capability to pick the hardest queries, best runs, and best groups. Systematic changes to the prompts make a difference in accuracy, but so too do simple paraphrases. To measure agreement with real searchers needs high-quality "gold" labels, but with these we find that models produce better labels than third-party workers, for a fraction of the cost, and these labels let us train notably better rankers.
- Good, neutral or bad news classification. In Proceedings of the Third International Workshop on Recent Trends in News Information Retrieval. 9–14.
- Open-source large language models outperform crowd workers and approach ChatGPT in text-annotation tasks. arXiv:2307.02179 [cs.CL]
- Relevance Assessment: Are Judges Exchangeable and Does It Matter. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 667–674.
- On the dangers of stochastic parrots: Can language models be too big?. In Proceedings of the ACM Conference on Fairness, Accountability, and Transparency.
- Language (technology) is power: A critical survey of “bias”’ in NLP. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 5454–5476.
- Do people and neural nets pay attention to the same words: studying eye-tracking data for non-factoid QA evaluation. In Proceedings of the ACM International Conference on Information and Knowledge Management. 85–94.
- Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. Advances in neural information processing systems 29 (2016).
- On the opportunities and risks of foundation models. arXiv:2108.07258 [cs.LG]
- Andrei Broder. 2002. A taxonomy of web search. In ACM Sigir forum, Vol. 36. ACM New York, NY, USA, 3–10.
- Jake Brutlag. 2009. Speed matters for Google web search. Online: https://services.google.com/fh/files/blogs/google_delayexp.pdf. Downloaded 2023-09-14..
- Semantics derived automatically from language corpora contain human-like biases. Science 356, 6334 (2017), 183–186.
- Here or there: Preference judgments for relevance. In Proceedings of the European Conference on Information Retrieval. 16–27.
- A reference collection for web spam. SIGIR Forum 40, 2 (Dec. 2006), 11–24.
- K. Alec Chrystal and Paul D. Mizen. 2001. Goodhart’s law: Its origins, meaning and implications for monetary policy. Prepared for the Festschrift in honour of Charles Goodhart.
- HMC: A spectrum of human–machine-collaborative relevance judgment frameworks. In Frontiers of Information Access Experimentation for Research and Education, Christine Bauer, Ben Carterette, Nicola Ferro, and Norbert Fuhr (Eds.). Vol. 13. Leibniz-Zentrum für Informatik. Issue 1.
- Examining the limits of crowdsourcing for relevance assessment. IEEE Internet Computing 17, 4 (2013).
- Efficient construction of large test collections. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 282–289.
- Gauging the quality of relevance assessments using inter-rater agreement. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval.
- Measuring the carbon intensity of AI in cloud instances. In Proceedings of the ACM Conference on Fairness, Accountability, and Transparency. 1877–1894.
- Understanding user behavior through log data and analysis. In Ways of knowing in HCI, Judith S. Olson and Wendy A. Kellogg (Eds.). Springer, New York, 349–372.
- Perspectives on large language models for relevance judgment. arXiv:2304.09161 [cs.IR]
- ChatGPT outperforms crowd-workers for text-annotation tasks. arXiv:2303.15056 [cs.CL]
- Hila Gonen and Yoav Goldberg. 2019. Lipstick on a pig: Debiasing methods cover up systematic gender biases in word embeddings but do not remove them. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 609–614.
- Charles A E Goodhart. 1975. Problems of monetary management: The UK experience. In Papers in Monetary Economics. Vol. 1. Reserve Bank of Australia.
- Google LLC. 2022. General Guidelines. https://guidelines.raterhub.com/searchqualityevaluatorguidelines.pdf, Downloaded 29 July 2023..
- OHSUMED: An interactive retrieval evaluation and new large test collection for research. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 192–201.
- Distilling the Knowledge in a Neural Network. In NIPS Deep Learning and Representation Learning Workshop. http://arxiv.org/abs/1503.02531
- Local self-attention over long text for efficient document retrieval. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 2021–2024.
- Keith Hoskin. 1996. The ‘awful’ idea of accountability: Inscribing people into the measurement of objects. In Accountability: Power, ethos and technologies of managing, R Munro and J Mouritsen (Eds.). International Thompson Business Press, London.
- Collect, measure, repeat: Reliability factors for responsible AI data collection. arXiv:2308.12885 [cs.LG]
- Andrej Karpathy. 2023. State of GPT. Seminar at Microsoft Build. https://build.microsoft.com/en-US/sessions/db3f4859-cd30-4445-a0cd-553c3304f8e2.
- Less is Less: When are Snippets Insufficient for Human vs Machine Relevance Estimation?. In Proceedings of the European Conference on Information Retrieval. 153–162.
- Large Language Models are Zero-Shot Reasoners. arXiv:2205.11916 [cs.CL]
- Holistic evaluation of language models. arXiv:2211.09110 [cs.CL]
- Tie-Yan Liu. 2009. Learning to rank for information retrieval. Foundations and Trends in Information Retrieval 3, 3 (2009), 225–331.
- Safiya Umoja Noble. 2018. Algorithms of oppression. In Algorithms of oppression. New York University Press.
- OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
- The carbon footprint of machine learning training will plateau, then shrink. Computer 55, 7 (2022), 18–28.
- Carbon emissions and large neural network training. (2021). arXiv:2104.10350 [cs.LG]
- Automatic prompt optimization with “gradient descent” and beam search. arXiv:2305.03495
- Tefko Saracevic. 2008. Effects of inconsistent relevance judgments on information retrieval test results: A historical perspective. Library Trends 56, 4 (2008), 763–783.
- The effect of threshold priming and need for cognition on relevance calibration and assessment. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 623–632.
- Eric Schurman and Jake Brutlag. 2009. Performance related changes and their user impact. In Velocity web performance and operations conference.
- Latanya Sweeney. 2013. Discrimination in online ad delivery. Commun. ACM 56, 5 (2013), 44–54.
- The crowd is made of people: Observations from large-scale crowd labelling. In Proceedings of the Conference on Human Information Interaction and Retrieval.
- Rachel L. Thomas and David Uminsky. 2022. Reliance on metrics is a fundamental challenge for AI. Patterns 3, 5 (2022).
- Petter Törnberg. 2023. ChatGPT-4 outperforms experts and crowd workers in annotating political Twitter messages with zero-shot learning. arXiv:2304.06588 [cs.CL]
- Ellen M Voorhees. 1998. Variations in relevance judgments and the measurement of retrieval effectiveness. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 315–323.
- Ellen M Voorhees. 2004. Overview of the TREC 2004 Robust Retrieval Track. In Proceedings of the Text REtrieval Conference.
- How far can camels go? Exploring the state of instruction tuning on open resources. arXiv:2306.04751 [cs.CL]
- A Similarity Measure for Indefinite Rankings. ACM Transactions on Information Systems 28, 4, Article 20 (Nov. 2010).
- Chain-of-thought prompting elicits reasoning in large language models. arXiv:2201.11903 [cs.CL]
- Sustainable AI: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4 (2022), 795–813.
- Large language models as optimisers. arXiv:2309.03409 [cs.LG]
- TEMPERA: Test-time prompt editing via reinforcement learning. arXiv:2211.11890 [cs.CL]
- Large language models are human-level prompt engineers. arXiv:2211.01910 [cs.LG]
- Paul Thomas (14 papers)
- Seth Spielman (3 papers)
- Nick Craswell (51 papers)
- Bhaskar Mitra (78 papers)