Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Reliable Confidence Intervals for Information Retrieval Evaluation Using Generative A.I (2407.02464v1)

Published 2 Jul 2024 in cs.IR and stat.ML

Abstract: The traditional evaluation of information retrieval (IR) systems is generally very costly as it requires manual relevance annotation from human experts. Recent advancements in generative artificial intelligence -- specifically LLMs -- can generate relevance annotations at an enormous scale with relatively small computational costs. Potentially, this could alleviate the costs traditionally associated with IR evaluation and make it applicable to numerous low-resource applications. However, generated relevance annotations are not immune to (systematic) errors, and as a result, directly using them for evaluation produces unreliable results. In this work, we propose two methods based on prediction-powered inference and conformal risk control that utilize computer-generated relevance annotations to place reliable confidence intervals (CIs) around IR evaluation metrics. Our proposed methods require a small number of reliable annotations from which the methods can statistically analyze the errors in the generated annotations. Using this information, we can place CIs around evaluation metrics with strong theoretical guarantees. Unlike existing approaches, our conformal risk control method is specifically designed for ranking metrics and can vary its CIs per query and document. Our experimental results show that our CIs accurately capture both the variance and bias in evaluation based on LLM annotations, better than the typical empirical bootstrapping estimates. We hope our contributions bring reliable evaluation to the many IR applications where this was traditionally infeasible.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (76)
  1. Open-source large language models outperform crowd workers and approach ChatGPT in text-annotation tasks. arXiv preprint arXiv:2307.02179 (2023).
  2. Anastasios N Angelopoulos and Stephen Bates. 2021. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv preprint arXiv:2107.07511 (2021).
  3. Prediction-powered inference. arXiv preprint arXiv:2301.09633 (2023).
  4. Conformal risk control. arXiv preprint arXiv:2208.02814 (2022).
  5. A statistical method for system evaluation using incomplete judgments. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. 541–548.
  6. Relevance assessment: are judges exchangeable and does it matter. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. 667–674.
  7. Conformal prediction for reliable machine learning: theory, adaptations and applications. Newnes.
  8. On the dangers of stochastic parrots: Can language models be too big?. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency. 610–623.
  9. Inpars: Unsupervised dataset generation for information retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2387–2392.
  10. F Jay Breidt and Jean D Opsomer. 2017. Model-assisted survey estimation with modern prediction techniques. (2017).
  11. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  12. Michael Buckland and Fredric Gey. 1994. The relationship between recall and precision. Journal of the American society for information science 45, 1 (1994), 12–19.
  13. Minimal test collections for retrieval evaluation. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. 268–275.
  14. Olivier Chapelle and Yi Chang. 2011. Yahoo! learning to rank challenge overview. In Proceedings of the learning to rank challenge. PMLR, 1–24.
  15. 4.2 HMC: A Spectrum of Human–Machine-Collaborative Relevance Judgment Frameworks. Frontiers of Information Access Experimentation for Research and Education (2023), 41.
  16. Gordon V Cormack and Thomas R Lynam. 2006. Statistical precision of information retrieval evaluation. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. 533–540.
  17. TREC deep learning track: Reusable test collections in the large data regime. In Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval. 2369–2375.
  18. Generative adversarial networks: An overview. IEEE signal processing magazine 35, 1 (2018), 53–65.
  19. Multilingual representations for low resource speech recognition and keyword search. In 2015 IEEE workshop on automatic speech recognition and understanding (ASRU). IEEE, 259–266.
  20. The Istella22 Dataset: Bridging Traditional and Neural Learning to Rank Evaluation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3099–3107.
  21. Predicting relevance based on assessor disagreement: analysis and practical applications for search evaluation. Information Retrieval Journal 19 (2016), 284–312.
  22. Thomas J DiCiccio and Bradley Efron. 1996. Bootstrap confidence intervals. Statistical science 11, 3 (1996), 189–228.
  23. Thomas J Diciccio and Joseph P Romano. 1988. A review of bootstrap confidence intervals. Journal of the Royal Statistical Society Series B: Statistical Methodology 50, 3 (1988), 338–354.
  24. Bradley Efron. 1987. Better bootstrap confidence intervals. Journal of the American statistical Association 82, 397 (1987), 171–185.
  25. Perspectives on large language models for relevance judgment. In Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval. 39–50.
  26. Norbert Fuhr. 2018. Some common mistakes in IR evaluation, and how they can be avoided. In Acm sigir forum, Vol. 51. ACM New York, NY, USA, 32–41.
  27. Chatgpt outperforms crowd-workers for text-annotation tasks. arXiv preprint arXiv:2303.15056 (2023).
  28. Generative adversarial networks. Commun. ACM 63, 11 (2020), 139–144.
  29. A deep look into neural ranking models for information retrieval. Information Processing & Management 57, 6 (2020), 102067.
  30. Donna Harman. 2011. Information retrieval evaluation. Morgan & Claypool Publishers.
  31. Donna K Harman. 2005. The TREC test collections. (2005).
  32. Tim Hesterberg. 2011. Bootstrap. Wiley Interdisciplinary Reviews: Computational Statistics 3, 6 (2011), 497–526.
  33. George Hripcsak and Adam S Rothschild. 2005. Agreement, the f-measure, and reliability in information retrieval. Journal of the American medical informatics association 12, 3 (2005), 296–298.
  34. Improving Cross-lingual Information Retrieval on Low-Resource Languages via Optimal Transport Distillation. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining. 1048–1056.
  35. Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS) 20, 4 (2002), 422–446.
  36. Kalervo Järvelin and Jaana Kekäläinen. 2017. IR evaluation methods for retrieving highly relevant documents. In ACM SIGIR Forum, Vol. 51. ACM New York, NY, USA, 243–250.
  37. Mladan Jovanovic and Mark Campbell. 2022. Generative artificial intelligence: Trends and prospects. Computer 55, 10 (2022), 107–112.
  38. Overview of IR tasks at the first NTCIR workshop. In Proceedings of the first NTCIR workshop on research in Japanese text retrieval and term recognition. 11–44.
  39. Jaana Kekäläinen and Kalervo Järvelin. 2002. Using graded relevance assessments in IR evaluation. Journal of the American Society for Information Science and Technology 53, 13 (2002), 1120–1129.
  40. Michael E Lesk and Gerard Salton. 1968. Relevance assessments and retrieval system evaluation. Information storage and retrieval 4, 4 (1968), 343–359.
  41. Generative artificial intelligence and its applications in materials science: Current situation and future perspectives. Journal of Materiomics (2023).
  42. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688 (2023).
  43. Selective gradient boosting for effective learning to rank. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. 155–164.
  44. Sean MacAvaney and Luca Soldaini. 2023. One-Shot Labeling for Automatic Relevance Estimation. arXiv preprint arXiv:2302.11266 (2023).
  45. Bhaskar Mitra and Nick Craswell. 2017. Neural models for information retrieval. arXiv preprint arXiv:1705.01509 (2017).
  46. An introduction to neural information retrieval. Foundations and Trends® in Information Retrieval 13, 1 (2018), 1–126.
  47. MS MARCO: A human generated machine reading comprehension dataset. choice 2640 (2016), 660.
  48. Neural information retrieval: at the end of the early years. Information Retrieval Journal 21 (2018), 111–182.
  49. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
  50. Harris Papadopoulos. 2008. Inductive conformal prediction: Theory and application to neural networks. In Tools in artificial intelligence. Citeseer.
  51. John V Pavlik. 2023. Collaborating with ChatGPT: Considering the implications of generative artificial intelligence for journalism and media education. Journalism & Mass Communication Educator 78, 1 (2023), 84–93.
  52. Carol Peters. 2001. Cross-Language Information Retrieval and Evaluation: Workshop of Cross-Language Evaluation Forum, CLEF 2000, Lisbon, Portugal, September 21-22, 2000, Revised Papers. Vol. 2069. Springer Science & Business Media.
  53. Tao Qin and Tie-Yan Liu. 2013. Introducing LETOR 4.0 datasets. arXiv preprint arXiv:1306.2597 (2013).
  54. LETOR: A benchmark collection for research on learning to rank for information retrieval. Information Retrieval 13 (2010), 346–374.
  55. Large language models are effective text rankers with pairwise ranking prompting. arXiv preprint arXiv:2306.17563 (2023).
  56. Tetsuya Sakai. 2014. Statistical reform in information retrieval?. In ACM SIGIR Forum, Vol. 48. ACM New York, NY, USA, 3–12.
  57. Mark Sanderson et al. 2010. Test collection based evaluation of information retrieval systems. Foundations and Trends® in Information Retrieval 4, 4 (2010), 247–375.
  58. Mark Sanderson and Justin Zobel. 2005. Information retrieval system evaluation: effort, sensitivity, and reliability. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval. 162–169.
  59. A comparison of statistical significance tests for information retrieval evaluation. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management. 623–632.
  60. Unifying language learning paradigms. arXiv preprint arXiv:2205.05131 (2022).
  61. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023).
  62. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663 (2021).
  63. Large language models can accurately predict searcher preferences. arXiv preprint arXiv:2309.10621 (2023).
  64. Petter Törnberg. 2023. Chatgpt-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning. arXiv preprint arXiv:2304.06588 (2023).
  65. An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC bioinformatics 16, 1 (2015), 1–28.
  66. Statistical significance testing in information retrieval: an empirical analysis of type I, type II and type III errors. In Proceedings of the 42nd International ACM SIGIR conference on Research and development in information retrieval. 505–514.
  67. Cornelis Joost Van Rijsbergen and W Bruce Croft. 1975. Document clustering: An evaluation of some experiments with the Cranfield 1400 collection. Information Processing & Management 11, 5-7 (1975), 171–182.
  68. TREC-COVID: constructing a pandemic information retrieval test collection. In ACM SIGIR Forum, Vol. 54. ACM New York, NY, USA, 1–12.
  69. Ellen M Voorhees. 2019. The evolution of cranfield. Information Retrieval Evaluation in a Changing World: Lessons Learned from 20 Years of CLEF (2019), 45–69.
  70. Ellen M Voorhees et al. 2003. Overview of the TREC 2003 robust retrieval track.. In Trec. 69–77.
  71. TREC: Experiment and evaluation in information retrieval. Vol. 63. MIT press Cambridge.
  72. William Webber. 2013. Approximate recall confidence intervals. ACM Transactions on Information Systems (TOIS) 31, 1 (2013), 1–33.
  73. William Edward Webber. 2010. Measurement in information retrieval evaluation. Ph. D. Dissertation. University of Melbourne, Department of Computer Science and Software Engineering.
  74. Robust document representations for cross-lingual information retrieval in low-resource settings. In Proceedings of Machine Translation Summit XVII: Research Track. 12–20.
  75. A simple and efficient sampling method for estimating AP and NDCG. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. 603–610.
  76. Beyond yes and no: Improving zero-shot llm rankers via scoring fine-grained relevance labels. arXiv preprint arXiv:2310.14122 (2023).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Harrie Oosterhuis (44 papers)
  2. Rolf Jagerman (18 papers)
  3. Zhen Qin (105 papers)
  4. Xuanhui Wang (36 papers)
  5. Michael Bendersky (63 papers)
Citations (2)
X Twitter Logo Streamline Icon: https://streamlinehq.com