This paper introduces LLM4Ranking, a unified and extensible framework designed to simplify the use of LLMs for document reranking tasks (Liu et al., 10 Apr 2025 ). The authors note that while using LLMs for reranking is a promising research direction with applications in search engines and retrieval-augmented generation (RAG), existing tools lack the flexibility to support a wide range of LLMs (both open-source and proprietary APIs) and reranking paradigms (pointwise, pairwise, listwise, etc.).
LLM4Ranking addresses this gap by providing:
- A Unified and Extensible Interface: It allows users to easily integrate and switch between different LLMs, including open-source models via HuggingFace Transformers (with support for quantization like bitsandbytes/GPTQ and acceleration via vLLM) and proprietary models via OpenAI SDK-compatible APIs.
- Support for Diverse Reranking Methods: The framework implements popular reranking paradigms (pointwise, pairwise, listwise) and specific models like RankGPT and TourRank. It decouples the abstract ranking logic (e.g., pointwise scoring) from the concrete model implementation (e.g., relevance generation vs. query generation), making it easy to customize and add new methods. It supports models based on generation, log-likelihood computation, and direct logit usage.
- Integrated Training and Evaluation: LLM4Ranking includes ready-to-use scripts for training models. For generation-based and log-likelihood-based models, it provides standard Supervised Fine-Tuning (SFT) scripts compatible with HuggingFace Transformers and PEFT methods like LoRA. For logits-based models, it offers separate training code inspired by cross-encoders, supporting various loss functions (Cross-Entropy, Margin-MSE, LTR losses). It also provides standardized evaluation scripts for multiple benchmark datasets (TREC DL, BEIR, MAIR, NevIR, BRIGHT), calculating standard IR metrics (MAP, NDCG, Recall) and logging detailed results like latency and token usage.
The framework's architecture consists of three core components: the LLM Interface (handling interactions with different LLMs), Ranking Logic Abstraction (defining the reranking paradigm), and Model (implementing specific ranking algorithms).
To demonstrate its capabilities, the authors conducted experiments using LLM4Ranking. They evaluated several zero-shot reranking methods (Relevance Generation, PRP-Heapsort, RankGPT, TourRank) with various open-source (Llama 3.1, Qwen 2.5) and API-based LLMs (GPT-4o, Claude 3.7 Sonnet, DeepSeek-V3) on TREC DL datasets. Results showed that API-based models generally performed better, and RankGPT was consistently effective. They also trained and evaluated supervised models (pointwise Relevance Generation and listwise RankGPT distillation) using smaller Qwen 2.5 models, demonstrating that fine-tuned smaller models can achieve performance comparable to larger zero-shot models.
The paper concludes that LLM4Ranking serves as a valuable and easy-to-use toolkit for both researchers and practitioners, facilitating reproducible experiments and the development of LLM-based reranking applications. The code is made publicly available.