Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Query Performance Prediction using Relevance Judgments Generated by Large Language Models (2404.01012v2)

Published 1 Apr 2024 in cs.IR, cs.AI, cs.CL, and cs.LG

Abstract: Query performance prediction (QPP) aims to estimate the retrieval quality of a search system for a query without human relevance judgments. Previous QPP methods typically return a single scalar value and do not require the predicted values to approximate a specific information retrieval (IR) evaluation measure, leading to certain drawbacks: (i) a single scalar is insufficient to accurately represent different IR evaluation measures, especially when metrics do not highly correlate, and (ii) a single scalar limits the interpretability of QPP methods because solely using a scalar is insufficient to explain QPP results. To address these issues, we propose a QPP framework using automatically generated relevance judgments (QPP-GenRE), which decomposes QPP into independent subtasks of predicting the relevance of each item in a ranked list to a given query. This allows us to predict any IR evaluation measure using the generated relevance judgments as pseudo-labels. This also allows us to interpret predicted IR evaluation measures, and identify, track and rectify errors in generated relevance judgments to improve QPP quality. We predict an item's relevance by using open-source LLMs to ensure scientific reproducibility. We face two main challenges: (i) excessive computational costs of judging an entire corpus for predicting a metric considering recall, and (ii) limited performance in prompting open-source LLMs in a zero-/few-shot manner. To solve the challenges, we devise an approximation strategy to predict an IR measure considering recall and propose to fine-tune open-source LLMs using human-labeled relevance judgments. Experiments on the TREC 2019-2022 deep learning tracks show that QPP-GenRE achieves state-of-the-art QPP quality for both lexical and neural rankers.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (115)
  1. LLM-based Retrieval and Generation Pipelines for TREC Interactive Knowledge Assistance Track (iKAT) 2023. In TREC.
  2. Query Performance Prediction Through Retrieval Coherency. In ECIR. Springer, 193–200.
  3. Noisy Perturbations for Estimating Query Difficulty in Dense Retrievers. In CIKM. 3722–3727.
  4. BERT-QPP: Contextualized Pre-trained Transformers for Query Performance Prediction. In CIKM. 2857–2861.
  5. Query Performance Prediction: From Fundamentals to Advanced Techniques. In ECIR. Springer, 381–388.
  6. Javed A Aslam and Virgil Pavlu. 2007. Query Hardness Estimation Using Jensen-Shannon Divergence Among Multiple Scoring Functions. In ECIR. Springer, 198–209.
  7. Holistic Evaluation of Language Models. Annals of the New York Academy of Sciences (2023).
  8. Language Models are Few-Shot Learners. In NeurIPS. 1877–1901.
  9. David Carmel and Elad Yom-Tov. 2010. Estimating the Query Difficulty for Information Retrieval. Synthesis Lectures on Information Concepts, Retrieval, and Services 2, 1 (2010), 1–89.
  10. QuIP: 2-Bit Quantization of Large Language Models With Guarantees. arXiv preprint arXiv:2307.13304 (2023).
  11. Groupwise Query Performance Prediction with Bert. In ECIR. Springer, 64–74.
  12. Scaling Instruction-finetuned Language Models. arXiv preprint arXiv:2210.11416 (2022).
  13. Overview of the TREC 2020 Deep Learning Track. In TREC.
  14. Overview of the TREC 2019 Deep Learning Track. In TREC.
  15. Overview of the TREC 2021 Deep Learning Track. In TREC.
  16. Overview of the TREC 2022 Deep Learning Track. In TREC.
  17. Predicting Query Performance. In SIGIR. 299–306.
  18. Improved Query Performance Prediction Using Standard Deviation. In SIGIR. 1089–1090.
  19. Deep-QPP: A Pairwise Interaction-based Deep Learning Model for Supervised Query Performance Prediction. In WSDM. 201–209.
  20. A Relative Information Gain-based Query Performance Prediction Framework with Generated Query Variants. TOIS (2022).
  21. A ‘Pointwise-Query, Listwise-Document’based Query Performance Prediction Approach. In SIGIR. 2148–2153.
  22. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv preprint arXiv:2305.14314 (2023).
  23. Learning to Rank System Configurations. In CIKM. 2001–2004.
  24. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL. 4171–4186.
  25. Giorgio Maria Di Nunzio and Guglielmo Faggioli. 2021. A Study of a Gain Based Approach for Query Aspects in Recall Oriented Tasks. Applied Sciences 11, 19 (2021), 9075.
  26. Fernando Diaz. 2007. Performance Prediction Using Spatial Autocorrelation. In SIGIR. 583–590.
  27. A Survey for In-Context Learning. arXiv preprint arXiv:2301.00234 (2022).
  28. PaRaDe: Passage Ranking using Demonstrations with LLMs. In Findings of EMNLP. 14242–14252.
  29. Perspectives on Large Language Models for Relevance Judgment. In ICTIR. 39–50.
  30. Hierarchical Dependence-aware Evaluation Measures for Conversational Search. In SIGIR. 1935–1939.
  31. Report on the 1st Workshop on Query Performance Prediction and Its Evaluation in New Tasks (QPP++ 2023) at ECIR 2023. In ACM SIGIR Forum, Vol. 57. 1–7.
  32. A Spatial Approach to Predict Performance of Conversational Search Systems. In IIR. 41–46.
  33. A Geometric Framework for Query Performance Prediction in Conversational Search. In SIGIR. 1355–1365.
  34. Towards Query Performance Prediction for Neural Information Retrieval: Challenges and Opportunities. In ICTIR. 51–63.
  35. Query Performance Prediction for Neural IR: Are We There Yet?. In ECIR. Springer, 232–248.
  36. An Enhanced Evaluation Framework for Query Performance Prediction. In ECIR. Springer, 115–129.
  37. sMARE: A New Paradigm to Evaluate and Understand Query Performance Prediction Methods. Information Retrieval Journal 25, 2 (2022), 94–122.
  38. Bootstrapped nDCG Estimation in the Presence of Unjudged Documents. In ECIR. Springer, 313–329.
  39. An Analysis of Variations in the Effectiveness of Query Performance Prediction. In ECIR. Springer, 215–229.
  40. Debasis Ganguly and Emine Yilmaz. 2023. Query-specific Variable Depth Pooling via Query Performance Prediction. In SIGIR. 2303–2307.
  41. Parameter-Efficient Fine-Tuning of LLaMA for the Clinical Domain. arXiv preprint arXiv:2307.03042 (2023).
  42. ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks. arXiv preprint arXiv:2303.15056 (2023).
  43. Knowledge Distillation of Large Language Models. arXiv preprint arXiv:2306.08543 (2023).
  44. Correlation, Prediction and Ranking of Evaluation Metrics in Information Retrieval. In ECIR. Springer, 636–651.
  45. Performance Prediction for Non-factoid Question Answering. In ICTIR. 55–58.
  46. A Survey of Pre-retrieval Query Performance Predictors. In CIKM. 1419–1420.
  47. Large Language Models are Zero-Shot Rankers for Recommender Systems. arXiv preprint arXiv:2305.08845 (2023).
  48. LoRA: Low-Rank Adaptation of Large Language Models. In ICLR.
  49. Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated Gain-Based Evaluation of IR Techniques. TOIS 20, 4 (2002), 422–446.
  50. Features of Disagreement Between Retrieval Effectiveness Measures. In SIGIR. 847–850.
  51. Maryam Khodabakhsh and Ebrahim Bagheri. 2023. Learning to Rank and Predict: Multi-task Learning for Ad Hoc Retrieval and Query Performance Prediction. Information Sciences 639 (2023), 119015.
  52. Diederik P Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR.
  53. John Lafferty and Chengxiang Zhai. 2001. Document Language Models, Query Models, and Risk Minimization for Information Retrieval. In SIGIR. 111–119.
  54. Victor Lavrenko and W Bruce Croft. 2001. Relevance-Based Language Models. In SIGIR. 120–127.
  55. Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning. In NeurIPS.
  56. Tiedong Liu and Bryan Kian Hsiang Low. 2023. Goat: Fine-tuned LLaMA Outperforms GPT-4 on Arithmetic Tasks. arXiv preprint arXiv:2305.14201 (2023).
  57. The Effect of Pooling and Evaluation Depth on IR Metrics. Information Retrieval Journal 19, 4 (2016), 416–445.
  58. An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models. arXiv preprint arXiv:2309.09958 (2023).
  59. Patent Retrieval. Foundations and Trends® in Information Retrieval 7, 1 (2013), 1–97.
  60. Fine-Tuning LLaMA for Multi-Stage Text Retrieval. arXiv preprint arXiv:2310.08319 (2023).
  61. Zero-Shot Listwise Document Reranking with a Large Language Model. arXiv preprint arXiv:2305.02156 (2023).
  62. Sean MacAvaney and Luca Soldaini. 2023. One-Shot Labeling for Automatic Relevance Estimation. In SIGIR. 2230–2235.
  63. Using Supervised Machine Learning to Automatically Build Relevance Judgments for a Test Collection. In 2017 28th International Workshop on Database and Expert Systems Applications (DEXA). IEEE, 108–112.
  64. Towards Automatic Generation of Relevance Judgments for a Test Collection. In ICDIM. IEEE, 121–126.
  65. Performance Prediction for Conversational Search Using Perplexities of Query Rewrites. In QPP++2023. 25–28.
  66. Query Performance Prediction: From Ad-hoc to Conversational Search. In SIGIR. 2583–2593.
  67. Ranked List Truncation: From Retrieval to Re-ranking. In SIGIR.
  68. RefNet: A Reference-aware Network for Background Based Conversation. In AAAI.
  69. Initiative-Aware Self-Supervised Learning for Knowledge-Grounded Conversations. In SIGIR. 522–532.
  70. DukeNet: A Dual Knowledge Interaction Network for Knowledge-Grounded Conversation. In SIGIR. 1151–1160.
  71. Alistair Moffat. 2017. Computing Maximized Effectiveness Distance for Recall-based Metrics. TKDE 30, 1 (2017), 198–203.
  72. Rabia Nuray and Fazli Can. 2003. Automatic Ranking of Retrieval Systems in Imperfect Environments. In SIGIR. 379–380.
  73. Rabia Nuray and Fazli Can. 2006. Automatic Ranking of Information Retrieval Systems using Data Fusion. IPM 42, 3 (2006), 595–614.
  74. Joaquín Pérez-Iglesias and Lourdes Araujo. 2010. Standard Deviation as a Query Hardness Estimator. In SPIRE. Springer, 207–212.
  75. IQPP: A Benchmark for Image Query Performance Prediction. In SIGIR. 2953–2963.
  76. The Expando-Mono-Duo Design Pattern for Text Ranking with Pretrained Sequence-to-Sequence Models. arXiv preprint arXiv:2101.05667 (2021).
  77. RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Large Language Models. arXiv preprint arXiv:2309.15088 (2023).
  78. RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze! arXiv preprint arXiv:2312.02724 (2023).
  79. Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting. arXiv preprint arXiv:2306.17563 (2023).
  80. Ranking Retrieval Systems using Pseudo Relevance Judgments. Aslib Journal of Information Management 67, 6 (2015), 700–714.
  81. The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends® in Information Retrieval 3, 4 (2009), 333–389.
  82. Haggai Roitman. 2017. An Enhanced Approach to Query Performance Prediction Using Reference Lists. In SIGIR. 869–872.
  83. Improving Passage Retrieval with Zero-Shot Question Generation. In EMNLP. 3781–3797.
  84. Mohammadreza Samadi and Davood Rafiei. 2023. Performance Prediction for Multi-hop Questions. arXiv preprint arXiv:2308.06431 (2023).
  85. Andrea Santilli and Emanuele Rodolà. 2023. Camoscio: An Italian Instruction-tuned Llama. arXiv preprint arXiv:2307.16456 (2023).
  86. Query Variation Performance Prediction for Systematic Reviews. In SIGIR. 1089–1092.
  87. Using Statistical Decision Theory and Relevance Models for Query-performance Prediction. In SIGIR. 259–266.
  88. Predicting Query Performance by Query-Drift Estimation. TOIS 30, 2 (2012), 1–35.
  89. Unsupervised Query Performance Prediction for Neural Models utilising Pairwise Rank Preferences. In SIGIR. 2486–2490.
  90. Ranking Retrieval Systems without Relevance Judgments. In SIGIR. 66–73.
  91. Evaluating the Zero-shot Robustness of Instruction-tuned Language Models. arXiv preprint arXiv:2306.11270 (2023).
  92. Conversations Powered by Cross-Lingual Knowledge. In SIGIR. 1442–1451.
  93. Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents. In EMNLP. 14918–14937.
  94. Found in the Middle: Permutation Self-Consistency Improves Listwise Ranking in Large Language Models. arXiv preprint arXiv:2310.07712 (2023).
  95. Yongquan Tao and Shengli Wu. 2014. Query Performance Prediction by Considering Score Magnitude and Variance Together. In CIKM. 1891–1894.
  96. Tasks, Queries, and Rankers in Pre-Retrieval Performance Prediction. In Proceedings of the 22nd Australasian Document Computing Symposium. 1–4.
  97. Large Language Models Can Accurately Predict Searcher Preferences. arXiv preprint arXiv:2309.10621 (2023).
  98. Overview of the TREC 2007 Legal Track.. In TREC.
  99. Efficient and Effective Retrieval using Selective Pruning. In WSDM. 63–72.
  100. LLaMA: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971 (2023).
  101. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv preprint arXiv:2307.09288 (2023).
  102. Zephyr: Direct Distillation of LM Alignment. arXiv preprint arXiv:2310.16944 (2023).
  103. Chain-of-thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 35 (2022), 24824–24837.
  104. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. In ICLR.
  105. TensorGPT: Efficient Compression of the Embedding Layer in LLMs based on the Tensor-Train Decomposition. arXiv preprint arXiv:2307.00526 (2023).
  106. Neural Query Performance Prediction Using Weak Supervision from Multiple Signals. In SIGIR. 105–114.
  107. Entropy-Based Query Performance Prediction for Neural Information Retrieval Systems. In QPP++2023. 37–44.
  108. Rank-without-GPT: Building GPT-Independent Listwise Rerankers on Open-Source Large Language Models. arXiv preprint arXiv:2312.02969 (2023).
  109. Multi-Task Instruction Tuning of LLaMa for Specific Scenarios: A Preliminary Study on Writing Assistance. arXiv preprint arXiv:2305.13225 (2023).
  110. Yun Zhou and W Bruce Croft. 2006. Ranking Robustness: A Novel Framework to Predict Query Performance. In CIKM. 567–574.
  111. Yun Zhou and W Bruce Croft. 2007. Query Performance Prediction in Web Search Environments. In SIGIR. 543–550.
  112. Large Language Models for Information Retrieval: A Survey. arXiv preprint arXiv:2308.07107 (2023).
  113. Beyond Yes and No: Improving Zero-Shot LLM Rankers via Scoring Fine-Grained Relevance Labels. arXiv preprint arXiv:2310.14122 (2023).
  114. Open-source Large Language Models are Strong Zero-shot Query Likelihood Models for Document Ranking. arXiv preprint arXiv:2310.13243 (2023).
  115. A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking with Large Language Models. arXiv preprint arXiv:2310.09497 (2023).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Chuan Meng (11 papers)
  2. Negar Arabzadeh (28 papers)
  3. Arian Askari (19 papers)
  4. Mohammad Aliannejadi (86 papers)
  5. Maarten de Rijke (263 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.