Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLM4Ranking: An Easy-to-use Framework of Utilizing Large Language Models for Document Reranking (2504.07439v1)

Published 10 Apr 2025 in cs.IR and cs.CL

Abstract: Utilizing LLMs for document reranking has been a popular and promising research direction in recent years, many studies are dedicated to improving the performance and efficiency of using LLMs for reranking. Besides, it can also be applied in many real-world applications, such as search engines or retrieval-augmented generation. In response to the growing demand for research and application in practice, we introduce a unified framework, \textbf{LLM4Ranking}, which enables users to adopt different ranking methods using open-source or closed-source API-based LLMs. Our framework provides a simple and extensible interface for document reranking with LLMs, as well as easy-to-use evaluation and fine-tuning scripts for this task. We conducted experiments based on this framework and evaluated various models and methods on several widely used datasets, providing reproducibility results on utilizing LLMs for document reranking. Our code is publicly available at https://github.com/liuqi6777/LLM4ranking.

This paper introduces LLM4Ranking, a unified and extensible framework designed to simplify the use of LLMs for document reranking tasks (Liu et al., 10 Apr 2025 ). The authors note that while using LLMs for reranking is a promising research direction with applications in search engines and retrieval-augmented generation (RAG), existing tools lack the flexibility to support a wide range of LLMs (both open-source and proprietary APIs) and reranking paradigms (pointwise, pairwise, listwise, etc.).

LLM4Ranking addresses this gap by providing:

  1. A Unified and Extensible Interface: It allows users to easily integrate and switch between different LLMs, including open-source models via HuggingFace Transformers (with support for quantization like bitsandbytes/GPTQ and acceleration via vLLM) and proprietary models via OpenAI SDK-compatible APIs.
  2. Support for Diverse Reranking Methods: The framework implements popular reranking paradigms (pointwise, pairwise, listwise) and specific models like RankGPT and TourRank. It decouples the abstract ranking logic (e.g., pointwise scoring) from the concrete model implementation (e.g., relevance generation vs. query generation), making it easy to customize and add new methods. It supports models based on generation, log-likelihood computation, and direct logit usage.
  3. Integrated Training and Evaluation: LLM4Ranking includes ready-to-use scripts for training models. For generation-based and log-likelihood-based models, it provides standard Supervised Fine-Tuning (SFT) scripts compatible with HuggingFace Transformers and PEFT methods like LoRA. For logits-based models, it offers separate training code inspired by cross-encoders, supporting various loss functions (Cross-Entropy, Margin-MSE, LTR losses). It also provides standardized evaluation scripts for multiple benchmark datasets (TREC DL, BEIR, MAIR, NevIR, BRIGHT), calculating standard IR metrics (MAP, NDCG, Recall) and logging detailed results like latency and token usage.

The framework's architecture consists of three core components: the LLM Interface (handling interactions with different LLMs), Ranking Logic Abstraction (defining the reranking paradigm), and Model (implementing specific ranking algorithms).

To demonstrate its capabilities, the authors conducted experiments using LLM4Ranking. They evaluated several zero-shot reranking methods (Relevance Generation, PRP-Heapsort, RankGPT, TourRank) with various open-source (Llama 3.1, Qwen 2.5) and API-based LLMs (GPT-4o, Claude 3.7 Sonnet, DeepSeek-V3) on TREC DL datasets. Results showed that API-based models generally performed better, and RankGPT was consistently effective. They also trained and evaluated supervised models (pointwise Relevance Generation and listwise RankGPT distillation) using smaller Qwen 2.5 models, demonstrating that fine-tuned smaller models can achieve performance comparable to larger zero-shot models.

The paper concludes that LLM4Ranking serves as a valuable and easy-to-use toolkit for both researchers and practitioners, facilitating reproducible experiments and the development of LLM-based reranking applications. The code is made publicly available.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Qi Liu (485 papers)
  2. Haozhe Duan (1 paper)
  3. Yiqun Chen (20 papers)
  4. Quanfeng Lu (10 papers)
  5. Weiwei Sun (93 papers)
  6. Jiaxin Mao (47 papers)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com