- The paper introduces FLOPs-normalized metrics, RPP and QPP, which quantify ranking quality and query throughput per unit compute for LLM-based rerankers.
- The paper develops a closed-form estimator that reliably predicts FLOPs across different LLM architectures, aligning with empirical measurements.
- The paper demonstrates that pointwise reranking methods offer optimal trade-offs, achieving notable NDCG improvements with minimal computational cost compared to pairwise and listwise methods.
Efficiency-Effectiveness Reranking FLOPs for LLM-based Rerankers
This paper addresses a critical gap in the evaluation of LLM-based rerankers for information retrieval: the lack of hardware-agnostic, interpretable metrics for quantifying the efficiency-effectiveness tradeoff. While LLM-based rerankers have demonstrated strong performance on standard ranking metrics, their substantial computational demands pose significant challenges for real-world deployment. Existing efficiency metrics—such as latency, number of LLM calls, and token counts—are confounded by hardware, batch size, and model size, making cross-method and cross-system comparisons unreliable.
Proposed Metrics and FLOPs Estimator
The authors introduce two FLOPs-normalized metrics:
- Ranking metrics per PetaFLOP (RPP): Measures ranking quality (e.g., NDCG, MRR, MAP) per petaFLOP of computation, enabling direct comparison of effectiveness per unit compute.
- Queries per PetaFLOP (QPP): Quantifies throughput as the number of queries processed per petaFLOP, providing a hardware-agnostic measure of system scalability.
To support these metrics, the paper derives a closed-form, interpretable estimator for the total FLOPs required by an LLM-based reranker. The estimator accounts for model architecture (decoder-only and encoder-decoder), attention mechanism (multi-head and grouped-query), prompt and output lengths, and the number of reranked documents. The estimator is validated empirically against profiler-based FLOPs measurements, demonstrating strong linear correlation and robustness across model families and sizes.
Experimental Analysis
Comprehensive experiments are conducted on TREC-DL19 and DL20, evaluating a diverse set of reranking strategies (pointwise, pairwise, setwise, listwise) and LLM backbones (Flan-T5-large/xl/xxl, Llama-3.1-8B-Instruct, Qwen2.5). The results reveal several key findings:
- Pointwise methods (e.g., pointwise.yes_no) consistently achieve the highest RPP and QPP across all LLMs, delivering substantial NDCG improvements over BM25 with minimal computational cost. For instance, pointwise.yes_no with Flan-T5-large attains an RPP of 72.67 and QPP of 111.1, with a 10–30% NDCG gain over BM25.
- Scaling model size yields diminishing returns in effectiveness but severe degradation in efficiency. For setwise.heapsort, NDCG increases marginally from 0.670 (large) to 0.706 (xxl), while RPP drops from 26.8 to 1.84 and QPP from 40.0 to 2.61.
- Pairwise and listwise methods are computationally prohibitive. Pairwise.allpair, despite achieving the highest NDCG on Flan-T5-xl (0.713), requires 9,900 LLM calls per query, resulting in an RPP of 0.10 and QPP of 0.15—orders of magnitude less efficient than pointwise approaches.
- TourRank with Llama-3.1-8B-Instruct achieves the highest absolute NDCG (0.757/0.777) but at the lowest RPP and QPP, highlighting the cost of maximizing effectiveness without regard to compute.
The FLOPs estimator is further validated by demonstrating its linear relationship with both measured FLOPs and observed latency, as well as its sensitivity to prompt length. This confirms its utility as a reliable, hardware-agnostic proxy for computational cost.
Practical Implications
The introduction of RPP and QPP enables principled, hardware-independent benchmarking of LLM-based rerankers, facilitating fair comparison and informed system design. The FLOPs estimator allows practitioners to anticipate computational requirements and efficiency tradeoffs without executing the models, which is particularly valuable for early-stage architecture selection and scaling studies.
From a deployment perspective, the results strongly suggest that pointwise reranking methods are preferable for production systems where compute is a bottleneck. The severe efficiency penalties incurred by pairwise and listwise methods, especially as model size increases, render them impractical for large-scale or latency-sensitive applications unless further algorithmic or architectural optimizations are introduced.
Theoretical and Future Directions
The work formalizes the efficiency-effectiveness frontier for LLM-based reranking, providing a foundation for future research on compute-efficient ranking algorithms. The closed-form FLOPs estimator can be extended to more complex architectures, such as mixture-of-experts or retrieval-augmented models, though the authors note potential limitations in accuracy for such cases.
Future research directions include:
- Refining the estimator via regression against real measurements for advanced architectures.
- Incorporating additional system-level constraints (e.g., memory bandwidth, energy consumption) into the efficiency metrics.
- Exploring algorithmic innovations that close the efficiency-effectiveness gap, such as hybrid reranking pipelines or adaptive model selection based on query complexity.
Limitations
The estimator assumes consistent implementation across frameworks and may not capture library-level optimizations or hardware-specific kernel differences. While FLOPs provide a stable proxy for compute, they do not account for all real-world constraints, such as memory and energy bottlenecks or dynamic system loads.
Conclusion
This work establishes a rigorous, interpretable framework for evaluating the efficiency-effectiveness tradeoff in LLM-based reranking, grounded in FLOPs-normalized metrics and a validated estimator. The findings underscore the necessity of considering computational cost in reranker design and provide actionable guidance for both researchers and practitioners seeking to deploy LLM-based retrieval systems at scale.