Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Explain then Rank: Scale Calibration of Neural Rankers Using Natural Language Explanations from LLMs (2402.12276v2)

Published 19 Feb 2024 in cs.IR

Abstract: In search settings, calibrating the scores during the ranking process to quantities such as click-through rates or relevance levels enhances a system's usefulness and trustworthiness for downstream users. While previous research has improved this notion of calibration for low complexity learning-to-rank models, the larger data demands and parameter count specific to modern neural text rankers produce unique obstacles that hamper the efficacy of methods intended for the learning-to-rank setting. This paper proposes exploiting LLMs to provide relevance and uncertainty signals for these neural text rankers to produce scale-calibrated scores through Monte Carlo sampling of natural language explanations (NLEs). Our approach transforms the neural ranking task from ranking textual query-document pairs to ranking corresponding synthesized NLEs. Comprehensive experiments on two popular document ranking datasets show that the NLE-based calibration approach consistently outperforms past calibration methods and LLM-based methods for ranking, calibration, and query performance prediction tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Choppy: Cut Transformer for Ranked List Truncation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, Virtual Event China, 1513–1516. https://doi.org/10.1145/3397271.3401188
  2. Regression Compatible Listwise Objectives for Calibrated Ranking with Binary Relevance. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. ACM, Birmingham United Kingdom, 4502–4508. https://doi.org/10.1145/3583780.3614712
  3. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. https://doi.org/10.48550/arXiv.1611.09268 arXiv:1611.09268 [cs].
  4. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, Vol. 33. Curran Associates, Inc., 1877–1901. https://papers.nips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
  5. e-SNLI: Natural Language Inference with Natural Language Explanations. In Advances in Neural Information Processing Systems, Vol. 31. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2018/hash/4c7a167bb329bd92580a99ce422d6fa6-Abstract.html
  6. Ranking and Calibrating Click-Attributed Purchases in Performance Display Advertising. In Proceedings of the ADKDD’17. ACM, Halifax NS Canada, 1–6. https://doi.org/10.1145/3124749.3124755
  7. Not All Relevance Scores are Equal: Efficient Uncertainty and Calibration Modeling for Deep Retrieval Models. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 654–664. https://doi.org/10.1145/3404835.3462951 arXiv:2105.04651 [cs].
  8. Overview of the TREC 2019 deep learning track. http://arxiv.org/abs/2003.07820 arXiv:2003.07820 [cs].
  9. TREC Deep Learning Track: Reusable Test Collections in the Large Data Regime. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, Virtual Event Canada, 2369–2375. https://doi.org/10.1145/3404835.3463249
  10. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
  11. Perspectives on Large Language Models for Relevance Judgment. In Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval. ACM, Taipei Taiwan, 39–50. https://doi.org/10.1145/3578337.3605136
  12. Query Performance Prediction for Neural IR: Are We There Yet? http://arxiv.org/abs/2302.09947 arXiv:2302.09947 [cs].
  13. ExaRanker: Explanation-Augmented Neural Ranker. https://doi.org/10.48550/arXiv.2301.10521 arXiv:2301.10521 [cs].
  14. Knowledge Distillation of Large Language Models. https://doi.org/10.48550/arXiv.2306.08543 arXiv:2306.08543 [cs].
  15. On Calibration of Modern Neural Networks. http://arxiv.org/abs/1706.04599 arXiv:1706.04599 [cs].
  16. Language Models (Mostly) Know What They Know. http://arxiv.org/abs/2207.05221 arXiv:2207.05221 [cs].
  17. Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation. https://openreview.net/forum?id=VD-AYtP0dve
  18. Generate Neural Template Explanations for Recommendation. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management (CIKM ’20). Association for Computing Machinery, New York, NY, USA, 755–764. https://doi.org/10.1145/3340531.3411992
  19. Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013
  20. Teaching models to express their uncertainty in words. ([n. d.]).
  21. Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. http://arxiv.org/abs/1711.05101 arXiv:1711.05101 [cs, math].
  22. Overview of the NTCIR-14 We Want Web Task. (2019).
  23. Reducing conversational agents’ overconfidence through linguistic calibration. http://arxiv.org/abs/2012.14983 arXiv:2012.14983 [cs].
  24. Allan H. Murphy and Robert L. Winkler. 1977. Reliability of Subjective Probability Forecasts of Precipitation and Temperature. Journal of the Royal Statistical Society Series C: Applied Statistics 26, 1 (March 1977), 41–47. https://doi.org/10.2307/2346866
  25. Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. In arXiv:1901.04085 [cs]. http://arxiv.org/abs/1901.04085 arXiv: 1901.04085.
  26. Document Ranking with a Pretrained Sequence-to-Sequence Model. In Findings of the Association for Computational Linguistics: EMNLP 2020, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online, 708–718. https://doi.org/10.18653/v1/2020.findings-emnlp.63
  27. Multi-Stage Document Ranking with BERT. http://arxiv.org/abs/1910.14424 arXiv:1910.14424 [cs].
  28. Document Expansion by Query Prediction. Technical Report arXiv:1904.08375. arXiv. http://arxiv.org/abs/1904.08375 arXiv:1904.08375 [cs] type: article.
  29. Gustavo Penha and Claudia Hauff. 2021. On the Calibration and Uncertainty of Neural Learning to Rank Models for Conversational Search. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty (Eds.). Association for Computational Linguistics, Online, 160–170. https://doi.org/10.18653/v1/2021.eacl-main.12
  30. John Platt. 2000. Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. Adv. Large Margin Classif. 10 (June 2000).
  31. Conformal Language Modeling. http://arxiv.org/abs/2306.10193 arXiv:2306.10193 [cs].
  32. Explaining Documents’ Relevance to Search Queries. https://doi.org/10.48550/arXiv.2111.01314 arXiv:2111.01314 [cs].
  33. Distilling Reasoning Capabilities into Smaller Language Models. In Findings of the Association for Computational Linguistics: ACL 2023, Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 7059–7073. https://doi.org/10.18653/v1/2023.findings-acl.441
  34. CTR prediction for contextual advertising: learning-to-rank approach. In Proceedings of the Seventh International Workshop on Data Mining for Online Advertising. ACM, Chicago Illinois, 1–8. https://doi.org/10.1145/2501040.2501978
  35. Large language models can accurately predict searcher preferences. http://arxiv.org/abs/2309.10621 arXiv:2309.10621 [cs].
  36. Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback. https://doi.org/10.48550/arXiv.2305.14975 arXiv:2305.14975 [cs].
  37. Llama 2: Open Foundation and Fine-Tuned Chat Models. http://arxiv.org/abs/2307.09288 arXiv:2307.09288 [cs].
  38. Using Natural Language Explanations to Rescale Human Judgments. https://doi.org/10.48550/arXiv.2305.14770 arXiv:2305.14770 [cs].
  39. Self-Consistency Improves Chain of Thought Reasoning in Language Models. http://arxiv.org/abs/2203.11171 arXiv:2203.11171 [cs].
  40. Emergent Abilities of Large Language Models. ([n. d.]).
  41. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Advances in Neural Information Processing Systems 35 (Dec. 2022), 24824–24837. https://proceedings.neurips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html
  42. Eric W. Weisstein. [n. d.]. Bonferroni Correction. https://mathworld.wolfram.com/ Publisher: Wolfram Research, Inc..
  43. Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs. https://doi.org/10.48550/arXiv.2306.13063 arXiv:2306.13063 [cs].
  44. Scale Calibration of Deep Ranking Models. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, Washington DC USA, 4300–4309. https://doi.org/10.1145/3534678.3539072
  45. Xi Ye and Greg Durrett. 2022. The Unreliability of Explanations in Few-shot Prompting for Textual Reasoning. http://arxiv.org/abs/2205.03401 arXiv:2205.03401 [cs].
  46. Towards Explainable Search Results: A Listwise Explanation Generator. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, Madrid Spain, 669–680. https://doi.org/10.1145/3477495.3532067
  47. Neural Query Performance Prediction using Weak Supervision from Multiple Signals. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, Ann Arbor MI USA, 105–114. https://doi.org/10.1145/3209978.3210041
  48. Mitigating Bias in Search Results Through Contextual Document Reranking and Neutrality Regularization. (2022), 7.
  49. INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning. https://doi.org/10.48550/arXiv.2401.06532 arXiv:2401.06532 [cs].
  50. Beyond Yes and No: Improving Zero-Shot LLM Rankers via Scoring Fine-Grained Relevance Labels. http://arxiv.org/abs/2310.14122 arXiv:2310.14122 [cs].

Summary

We haven't generated a summary for this paper yet.