Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Thorough Comparison of Cross-Encoders and LLMs for Reranking SPLADE (2403.10407v1)

Published 15 Mar 2024 in cs.IR

Abstract: We present a comparative study between cross-encoder and LLMs rerankers in the context of re-ranking effective SPLADE retrievers. We conduct a large evaluation on TREC Deep Learning datasets and out-of-domain datasets such as BEIR and LoTTE. In the first set of experiments, we show how cross-encoder rerankers are hard to distinguish when it comes to re-rerank SPLADE on MS MARCO. Observations shift in the out-of-domain scenario, where both the type of model and the number of documents to re-rank have an impact on effectiveness. Then, we focus on listwise rerankers based on LLMs -- especially GPT-4. While GPT-4 demonstrates impressive (zero-shot) performance, we show that traditional cross-encoders remain very competitive. Overall, our findings aim to to provide a more nuanced perspective on the recent excitement surrounding LLM-based re-rankers -- by positioning them as another factor to consider in balancing effectiveness and efficiency in search systems.

Evaluating the Efficiency and Effectiveness of Cross-Encoders and LLM-Based Rerankers in Information Retrieval

Introduction

The landscape of Information Retrieval (IR) has been dramatically reshaped with the introduction of neural reranking methods, particularly with the advent of LLMs for task-specific applications. This paper provides a comprehensive comparison between two dominant paradigms within the domain of neural reranking: cross-encoders and LLM-based rerankers, using SPLADE models as effective first-stage retrievers. Through extensive evaluation on in-domain (TREC Deep Learning datasets) and out-of-domain datasets (BEIR and LoTTE), this research illuminates the nuanced advantages and limitations of employing cross-encoders versus LLM-based methods for reranking, providing key insights into their operational efficiency and effectiveness across varied IR contexts.

Cross-Encoders and LLM-Based Rerankers: A Comparative Analysis

The Efficacy of Cross-Encoders

Cross-encoders, exemplified by models such as DeBERTa-v3 and ELECTRA, have been the cornerstone for reranking efforts in IR systems due to their ability to model interactions between query-document pairs effectively. These models, when coupled with efficient retrievers like SPLADE-v3, demonstrate substantial improvements in retrieval quality across both in-domain and out-of-domain datasets. However, their performance is monumentally influenced by the number of top documents reranked (top_k) and can be hindered by extensive computational requirements, making large-scale or real-time applications challenging.

LLMs as Rerankers: The GPT-3.5 Turbo and GPT-4 Phenomenon

LLMs, especially GPT-4, have shown a surprising capability in reranking tasks even in a zero-shot setting. The paper indicates GPT-4's performance is competitive and, in certain scenarios, superior to traditional cross-encoders. Nonetheless, two significant caveats accompany the employment of GPT models for reranking: the prohibitive operational costs associated with using models like GPT-4 and the inefficiency induced by the model's constraint to manage large sets of documents for reranking. These factors pose substantial barriers to the practical deployment of LLMs in real-world IR systems.

The Implications and Future Directions

The nuanced analysis provides several critical insights for the deployment of neural rerankers in IR systems:

  • Effectiveness and Efficiency Balance: While LLMs (particularly GPT-4) offer competitive or superior performance metrics, cross-encoders like DeBERTa-v3 provide a more balanced trade-off between effectiveness and operational efficiency.
  • Resilience to Varied IR Contexts: The comparative efficacy of cross-encoders and LLM-based rerankers is context-dependent, with each exhibiting strengths in different IR scenarios—cross-encoders being more versatile across domains and LLMs showing exceptional prowess in specific contexts.
  • Future of Reranking Pipelines: The analysis suggests the potential of combining cross-encoders and LLMs in cascading reranking pipelines to leverage the unique strengths of both approaches, pointing towards a hybrid future in neural reranking methodologies.

Conclusion

This paper offers a granular investigation into the comparative merits of cross-encoders and LLM-based rerankers, framed by their application with the SPLADE models as first-stage retrievers. It presents a nuanced perspective that neither class of models universally outperforms the other across all IR tasks and settings. Instead, their deployment should be informed by a judicious assessment of the specific requirements and constraints of the application context, balancing the trade-offs between computational efficiency and retrieval effectiveness.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Yi: Open Foundation Models by 01.AI. arXiv:2403.04652 [cs.CL]
  2. Scaling Instruction-Finetuned Language Models. arXiv:2210.11416 [cs.LG]
  3. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. arXiv:2003.10555 [cs.CL]
  4. Overview of the TREC 2019 deep learning track. arXiv:2003.07820 [cs.IR]
  5. Overview of the TREC 2021 Deep Learning Track. In Text Retrieval Conference. https://api.semanticscholar.org/CorpusID:261242374
  6. Overview of the TREC 2022 Deep Learning Track. In Text Retrieval Conference. https://api.semanticscholar.org/CorpusID:261302277
  7. Overview of the TREC 2023 Deep Learning Track. In Text REtrieval Conference (TREC). NIST, TREC. https://www.microsoft.com/en-us/research/publication/overview-of-the-trec-2023-deep-learning-track/
  8. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In North American Chapter of the Association for Computational Linguistics. https://api.semanticscholar.org/CorpusID:52967399
  9. From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective. arXiv:2205.04733 [cs.IR]
  10. Rethink Training of BERT Rerankers in Multi-Stage Retrieval Pipeline. https://doi.org/10.48550/arXiv.2101.08751 arXiv:2101.08751 [cs].
  11. Rethink Training of BERT Rerankers in Multi-Stage Retrieval Pipeline. In European Conference on Information Retrieval. https://api.semanticscholar.org/CorpusID:231662379
  12. DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing. arXiv:2111.09543 [cs.CL]
  13. Carlos Lassance and Stéphane Clinchant. 2023. Naver Labs Europe (SPLADE) @ TREC Deep Learning 2022. arXiv:2302.12574 [cs.IR]
  14. SPLADE-v3: New baselines for SPLADE. arXiv:2403.06789 [cs.IR]
  15. Tie-Yan Liu. 2009. Learning to Rank for Information Retrieval. Found. Trends Inf. Retr. 3, 3 (mar 2009), 225–331. https://doi.org/10.1561/1500000016
  16. Fine-Tuning LLaMA for Multi-Stage Text Retrieval. https://doi.org/10.48550/arXiv.2310.08319 arXiv:2310.08319 [cs].
  17. Rodrigo Nogueira and Kyunghyun Cho. 2020. Passage Re-ranking with BERT. arXiv:1901.04085 [cs.IR]
  18. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
  19. The Expando-Mono-Duo Design Pattern for Text Ranking with Pretrained Sequence-to-Sequence Models. arXiv:2101.05667 [cs.IR]
  20. RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze! https://doi.org/10.48550/arXiv.2312.02724 arXiv:2312.02724 [cs].
  21. Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting. http://arxiv.org/abs/2306.17563 arXiv:2306.17563 [cs].
  22. Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Found. Trends Inf. Retr. 3, 4 (apr 2009), 333–389. https://doi.org/10.1561/1500000019
  23. Okapi at TREC-3.. In TREC (NIST Special Publication, Vol. 500-225), Donna K. Harman (Ed.). National Institute of Standards and Technology (NIST), 109–126. http://dblp.uni-trier.de/db/conf/trec/trec94.html#RobertsonWJHG94
  24. ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. In North American Chapter of the Association for Computational Linguistics. https://api.semanticscholar.org/CorpusID:244799249
  25. Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agent. https://doi.org/10.48550/arXiv.2304.09542 arXiv:2304.09542 [cs].
  26. Found in the Middle: Permutation Self-Consistency Improves Listwise Ranking in Large Language Models. http://arxiv.org/abs/2310.07712 arXiv:2310.07712 [cs].
  27. BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models. ArXiv abs/2104.08663 (2021). https://api.semanticscholar.org/CorpusID:233296016
  28. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL]
  29. Zephyr: Direct Distillation of LM Alignment. arXiv:2310.16944 [cs.LG]
  30. TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection. SIGIR Forum 54, 1, Article 1 (feb 2021), 12 pages. https://doi.org/10.1145/3451964.3451965
  31. RankingGPT: Empowering Large Language Models in Text Ranking with Progressive Enhancement. http://arxiv.org/abs/2311.16720 arXiv:2311.16720 [cs].
  32. RankT5: Fine-Tuning T5 for Text Ranking with Ranking Losses. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (, Taipei, Taiwan,) (SIGIR ’23). Association for Computing Machinery, New York, NY, USA, 2308–2313. https://doi.org/10.1145/3539618.3592047
  33. A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking with Large Language Models. http://arxiv.org/abs/2310.09497 arXiv:2310.09497 [cs].
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Stéphane Clinchant (39 papers)
  2. Thibault Formal (17 papers)
  3. Hervé Déjean (16 papers)
Citations (4)