Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Consolidating Ranking and Relevance Predictions of Large Language Models through Post-Processing (2404.11791v1)

Published 17 Apr 2024 in cs.IR

Abstract: The powerful generative abilities of LLMs show potential in generating relevance labels for search applications. Previous work has found that directly asking about relevancy, such as How relevant is document A to query Q?", results in sub-optimal ranking. Instead, the pairwise ranking prompting (PRP) approach produces promising ranking performance through asking about pairwise comparisons, e.g.,Is document A more relevant than document B to query Q?". Thus, while LLMs are effective at their ranking ability, this is not reflected in their relevance label generation. In this work, we propose a post-processing method to consolidate the relevance labels generated by an LLM with its powerful ranking abilities. Our method takes both LLM generated relevance labels and pairwise preferences. The labels are then altered to satisfy the pairwise preferences of the LLM, while staying as close to the original values as possible. Our experimental results indicate that our approach effectively balances label accuracy and ranking performance. Thereby, our work shows it is possible to combine both the ranking and labeling abilities of LLMs through post-processing.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Regression compatible listwise objectives for calibrated ranking with binary relevance. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 4502–4508.
  2. InPars: Unsupervised dataset generation for information retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2387–2392.
  3. PaLM 2 technical report.
  4. On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR.
  5. Query expansion by prompting large language models. arXiv preprint arXiv:2305.03653.
  6. Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20(4):422–446.
  7. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
  8. Pyserini: A Python toolkit for reproducible information retrieval research with sparse and dense representations. In Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021), pages 2356–2362.
  9. Tie-Yan Liu. 2009. Learning to rank for information retrieval. Foundation and Trends® in Information Retrieval, 3(3):225–331.
  10. Zero-shot listwise document reranking with a large language model. arXiv preprint arXiv:2305.02156.
  11. Predicting accurate probabilities with a ranking loss. In Proceedings of the 29th International Conference on Machine Learning, pages 703–710.
  12. Obtaining well calibrated probabilities using bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence.
  13. Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage re-ranking with BERT. arXiv preprint arXiv:1901.04085.
  14. Document ranking with a pretrained sequence-to-sequence model. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 708–718.
  15. OpenAI. 2023. GPT-4 technical report. arXiv preprint arXiv:2303.08774.
  16. John Platt. 2000. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Alexander J. Smola, Peter Bartlett, Bernhard Schölkopf, and Dale Schuurmans, editors, Advances in Large Margin Classifiers, page 61–74. MIT Press.
  17. Rankvicuna: Zero-shot listwise document reranking with open-source large language models. arXiv preprint arXiv:2309.15088.
  18. Large language models are effective text rankers with pairwise ranking prompting. arXiv preprint arXiv:2306.17563.
  19. Are neural rankers still outperformed by gradient boosted decision trees? In Proceedings of the 9th International Conference on Learning Representations.
  20. Distilling interpretable models into human-readable code. arXiv preprint arXiv:2101.08393.
  21. Improving passage retrieval with zero-shot question generation. arXiv preprint arXiv:2204.07496.
  22. Is ChatGPT good at search? investigating large language models as re-ranking agent. arXiv preprint arXiv:2304.09542.
  23. Found in the middle: Permutation self-consistency improves listwise ranking in large language models. arXiv preprint arXiv:2310.07712.
  24. Unifying language learning paradigms. arXiv preprint arXiv:2205.05131.
  25. Transformer memory as a differentiable search index. In Advances in Neural Information Processing Systems.
  26. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
  27. Large language models can accurately predict searcher preferences. arXiv preprint arXiv:2309.10621.
  28. Scipy 1.0: fundamental algorithms for scientific computing in python. Nature methods, 17(3):261–272.
  29. A survey on large language models for recommendation. arXiv preprint arXiv:2305.19860.
  30. Scale calibration of deep ranking models. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 4300–4309.
  31. Bianca Zadrozny and Charles Elkan. 2001. Learning and making decisions when costs and probabilities are both unknown. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, page 204–213.
  32. Bianca Zadrozny and Charles Elkan. 2002. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, page 694–699.
  33. Large language models for information retrieval: A survey. arXiv preprint arXiv:2308.07107.
  34. Beyond yes and no: Improving zero-shot llm rankers via scoring fine-grained relevance labels. arXiv preprint arXiv:2310.14122.
  35. RankT5: Fine-tuning T5 for text ranking with ranking losses. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2308–2313.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Le Yan (28 papers)
  2. Zhen Qin (105 papers)
  3. Honglei Zhuang (31 papers)
  4. Rolf Jagerman (18 papers)
  5. Xuanhui Wang (36 papers)
  6. Michael Bendersky (63 papers)
  7. Harrie Oosterhuis (44 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com