Explain then Rank: Scale Calibration of Neural Rankers Using Natural Language Explanations from LLMs (2402.12276v2)
Abstract: In search settings, calibrating the scores during the ranking process to quantities such as click-through rates or relevance levels enhances a system's usefulness and trustworthiness for downstream users. While previous research has improved this notion of calibration for low complexity learning-to-rank models, the larger data demands and parameter count specific to modern neural text rankers produce unique obstacles that hamper the efficacy of methods intended for the learning-to-rank setting. This paper proposes exploiting LLMs to provide relevance and uncertainty signals for these neural text rankers to produce scale-calibrated scores through Monte Carlo sampling of natural language explanations (NLEs). Our approach transforms the neural ranking task from ranking textual query-document pairs to ranking corresponding synthesized NLEs. Comprehensive experiments on two popular document ranking datasets show that the NLE-based calibration approach consistently outperforms past calibration methods and LLM-based methods for ranking, calibration, and query performance prediction tasks.
- Choppy: Cut Transformer for Ranked List Truncation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, Virtual Event China, 1513–1516. https://doi.org/10.1145/3397271.3401188
- Regression Compatible Listwise Objectives for Calibrated Ranking with Binary Relevance. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management. ACM, Birmingham United Kingdom, 4502–4508. https://doi.org/10.1145/3583780.3614712
- MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. https://doi.org/10.48550/arXiv.1611.09268 arXiv:1611.09268 [cs].
- Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, Vol. 33. Curran Associates, Inc., 1877–1901. https://papers.nips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
- e-SNLI: Natural Language Inference with Natural Language Explanations. In Advances in Neural Information Processing Systems, Vol. 31. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2018/hash/4c7a167bb329bd92580a99ce422d6fa6-Abstract.html
- Ranking and Calibrating Click-Attributed Purchases in Performance Display Advertising. In Proceedings of the ADKDD’17. ACM, Halifax NS Canada, 1–6. https://doi.org/10.1145/3124749.3124755
- Not All Relevance Scores are Equal: Efficient Uncertainty and Calibration Modeling for Deep Retrieval Models. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 654–664. https://doi.org/10.1145/3404835.3462951 arXiv:2105.04651 [cs].
- Overview of the TREC 2019 deep learning track. http://arxiv.org/abs/2003.07820 arXiv:2003.07820 [cs].
- TREC Deep Learning Track: Reusable Test Collections in the Large Data Regime. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, Virtual Event Canada, 2369–2375. https://doi.org/10.1145/3404835.3463249
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
- Perspectives on Large Language Models for Relevance Judgment. In Proceedings of the 2023 ACM SIGIR International Conference on Theory of Information Retrieval. ACM, Taipei Taiwan, 39–50. https://doi.org/10.1145/3578337.3605136
- Query Performance Prediction for Neural IR: Are We There Yet? http://arxiv.org/abs/2302.09947 arXiv:2302.09947 [cs].
- ExaRanker: Explanation-Augmented Neural Ranker. https://doi.org/10.48550/arXiv.2301.10521 arXiv:2301.10521 [cs].
- Knowledge Distillation of Large Language Models. https://doi.org/10.48550/arXiv.2306.08543 arXiv:2306.08543 [cs].
- On Calibration of Modern Neural Networks. http://arxiv.org/abs/1706.04599 arXiv:1706.04599 [cs].
- Language Models (Mostly) Know What They Know. http://arxiv.org/abs/2207.05221 arXiv:2207.05221 [cs].
- Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation. https://openreview.net/forum?id=VD-AYtP0dve
- Generate Neural Template Explanations for Recommendation. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management (CIKM ’20). Association for Computing Machinery, New York, NY, USA, 755–764. https://doi.org/10.1145/3340531.3411992
- Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, 74–81. https://aclanthology.org/W04-1013
- Teaching models to express their uncertainty in words. ([n. d.]).
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. http://arxiv.org/abs/1711.05101 arXiv:1711.05101 [cs, math].
- Overview of the NTCIR-14 We Want Web Task. (2019).
- Reducing conversational agents’ overconfidence through linguistic calibration. http://arxiv.org/abs/2012.14983 arXiv:2012.14983 [cs].
- Allan H. Murphy and Robert L. Winkler. 1977. Reliability of Subjective Probability Forecasts of Precipitation and Temperature. Journal of the Royal Statistical Society Series C: Applied Statistics 26, 1 (March 1977), 41–47. https://doi.org/10.2307/2346866
- Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. In arXiv:1901.04085 [cs]. http://arxiv.org/abs/1901.04085 arXiv: 1901.04085.
- Document Ranking with a Pretrained Sequence-to-Sequence Model. In Findings of the Association for Computational Linguistics: EMNLP 2020, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online, 708–718. https://doi.org/10.18653/v1/2020.findings-emnlp.63
- Multi-Stage Document Ranking with BERT. http://arxiv.org/abs/1910.14424 arXiv:1910.14424 [cs].
- Document Expansion by Query Prediction. Technical Report arXiv:1904.08375. arXiv. http://arxiv.org/abs/1904.08375 arXiv:1904.08375 [cs] type: article.
- Gustavo Penha and Claudia Hauff. 2021. On the Calibration and Uncertainty of Neural Learning to Rank Models for Conversational Search. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty (Eds.). Association for Computational Linguistics, Online, 160–170. https://doi.org/10.18653/v1/2021.eacl-main.12
- John Platt. 2000. Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. Adv. Large Margin Classif. 10 (June 2000).
- Conformal Language Modeling. http://arxiv.org/abs/2306.10193 arXiv:2306.10193 [cs].
- Explaining Documents’ Relevance to Search Queries. https://doi.org/10.48550/arXiv.2111.01314 arXiv:2111.01314 [cs].
- Distilling Reasoning Capabilities into Smaller Language Models. In Findings of the Association for Computational Linguistics: ACL 2023, Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 7059–7073. https://doi.org/10.18653/v1/2023.findings-acl.441
- CTR prediction for contextual advertising: learning-to-rank approach. In Proceedings of the Seventh International Workshop on Data Mining for Online Advertising. ACM, Chicago Illinois, 1–8. https://doi.org/10.1145/2501040.2501978
- Large language models can accurately predict searcher preferences. http://arxiv.org/abs/2309.10621 arXiv:2309.10621 [cs].
- Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback. https://doi.org/10.48550/arXiv.2305.14975 arXiv:2305.14975 [cs].
- Llama 2: Open Foundation and Fine-Tuned Chat Models. http://arxiv.org/abs/2307.09288 arXiv:2307.09288 [cs].
- Using Natural Language Explanations to Rescale Human Judgments. https://doi.org/10.48550/arXiv.2305.14770 arXiv:2305.14770 [cs].
- Self-Consistency Improves Chain of Thought Reasoning in Language Models. http://arxiv.org/abs/2203.11171 arXiv:2203.11171 [cs].
- Emergent Abilities of Large Language Models. ([n. d.]).
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Advances in Neural Information Processing Systems 35 (Dec. 2022), 24824–24837. https://proceedings.neurips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html
- Eric W. Weisstein. [n. d.]. Bonferroni Correction. https://mathworld.wolfram.com/ Publisher: Wolfram Research, Inc..
- Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs. https://doi.org/10.48550/arXiv.2306.13063 arXiv:2306.13063 [cs].
- Scale Calibration of Deep Ranking Models. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, Washington DC USA, 4300–4309. https://doi.org/10.1145/3534678.3539072
- Xi Ye and Greg Durrett. 2022. The Unreliability of Explanations in Few-shot Prompting for Textual Reasoning. http://arxiv.org/abs/2205.03401 arXiv:2205.03401 [cs].
- Towards Explainable Search Results: A Listwise Explanation Generator. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, Madrid Spain, 669–680. https://doi.org/10.1145/3477495.3532067
- Neural Query Performance Prediction using Weak Supervision from Multiple Signals. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, Ann Arbor MI USA, 105–114. https://doi.org/10.1145/3209978.3210041
- Mitigating Bias in Search Results Through Contextual Document Reranking and Neutrality Regularization. (2022), 7.
- INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning. https://doi.org/10.48550/arXiv.2401.06532 arXiv:2401.06532 [cs].
- Beyond Yes and No: Improving Zero-Shot LLM Rankers via Scoring Fine-Grained Relevance Labels. http://arxiv.org/abs/2310.14122 arXiv:2310.14122 [cs].