Efficiency-Effectiveness Reranking FLOPs for LLM-based Rerankers

Published 8 Jul 2025 in cs.CL, cs.AI, and cs.LG | (2507.06223v1)

Abstract: LLMs have recently been applied to reranking tasks in information retrieval, achieving strong performance. However, their high computational demands often hinder practical deployment. Existing studies evaluate the efficiency of LLM-based rerankers using proxy metrics such as latency, the number of forward passes, input tokens, and output tokens. However, these metrics depend on hardware and running-time choices (\eg parallel or not, batch size, etc), and often fail to account for model size, making it difficult to interpret and obscuring the evaluation of the efficiency-effectiveness tradeoff. To address this issue, we propose E\textsuperscript{2}R-FLOPs, for LLM-based rerankers: ranking metrics per PetaFLOP (RPP) for relevance per compute and queries per PetaFLOP (QPP) for hardware-agnostic throughput. Companied with the new metrics, an interpretable FLOPs estimator is built to estimate the FLOPs of an LLM-based reranker even without running any experiments. Based on the proposed metrics, we conduct comprehensive experiments to evaluate a wide range of LLM-based rerankers with different architecture, studying the efficiency-effectiveness trade-off and bringing this issue to the attention of the research community.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces FLOPs-normalized metrics, RPP and QPP, which quantify ranking quality and query throughput per unit compute for LLM-based rerankers.
The paper develops a closed-form estimator that reliably predicts FLOPs across different LLM architectures, aligning with empirical measurements.
The paper demonstrates that pointwise reranking methods offer optimal trade-offs, achieving notable NDCG improvements with minimal computational cost compared to pairwise and listwise methods.

Efficiency-Effectiveness Reranking FLOPs for LLM-based Rerankers

This paper addresses a critical gap in the evaluation of LLM-based rerankers for information retrieval: the lack of hardware-agnostic, interpretable metrics for quantifying the efficiency-effectiveness tradeoff. While LLM-based rerankers have demonstrated strong performance on standard ranking metrics, their substantial computational demands pose significant challenges for real-world deployment. Existing efficiency metrics—such as latency, number of LLM calls, and token counts—are confounded by hardware, batch size, and model size, making cross-method and cross-system comparisons unreliable.

Proposed Metrics and FLOPs Estimator

The authors introduce two FLOPs-normalized metrics:

Ranking metrics per PetaFLOP (RPP): Measures ranking quality (e.g., NDCG, MRR, MAP) per petaFLOP of computation, enabling direct comparison of effectiveness per unit compute.
Queries per PetaFLOP (QPP): Quantifies throughput as the number of queries processed per petaFLOP, providing a hardware-agnostic measure of system scalability.

To support these metrics, the paper derives a closed-form, interpretable estimator for the total FLOPs required by an LLM-based reranker. The estimator accounts for model architecture (decoder-only and encoder-decoder), attention mechanism (multi-head and grouped-query), prompt and output lengths, and the number of reranked documents. The estimator is validated empirically against profiler-based FLOPs measurements, demonstrating strong linear correlation and robustness across model families and sizes.

Experimental Analysis

Comprehensive experiments are conducted on TREC-DL19 and DL20, evaluating a diverse set of reranking strategies (pointwise, pairwise, setwise, listwise) and LLM backbones (Flan-T5-large/xl/xxl, Llama-3.1-8B-Instruct, Qwen2.5). The results reveal several key findings:

Pointwise methods (e.g., pointwise.yes_no) consistently achieve the highest RPP and QPP across all LLMs, delivering substantial NDCG improvements over BM25 with minimal computational cost. For instance, pointwise.yes_no with Flan-T5-large attains an RPP of 72.67 and QPP of 111.1, with a 10–30% NDCG gain over BM25.
Scaling model size yields diminishing returns in effectiveness but severe degradation in efficiency. For setwise.heapsort, NDCG increases marginally from 0.670 (large) to 0.706 (xxl), while RPP drops from 26.8 to 1.84 and QPP from 40.0 to 2.61.
Pairwise and listwise methods are computationally prohibitive. Pairwise.allpair, despite achieving the highest NDCG on Flan-T5-xl (0.713), requires 9,900 LLM calls per query, resulting in an RPP of 0.10 and QPP of 0.15—orders of magnitude less efficient than pointwise approaches.
TourRank with Llama-3.1-8B-Instruct achieves the highest absolute NDCG (0.757/0.777) but at the lowest RPP and QPP, highlighting the cost of maximizing effectiveness without regard to compute.

The FLOPs estimator is further validated by demonstrating its linear relationship with both measured FLOPs and observed latency, as well as its sensitivity to prompt length. This confirms its utility as a reliable, hardware-agnostic proxy for computational cost.

Practical Implications

The introduction of RPP and QPP enables principled, hardware-independent benchmarking of LLM-based rerankers, facilitating fair comparison and informed system design. The FLOPs estimator allows practitioners to anticipate computational requirements and efficiency tradeoffs without executing the models, which is particularly valuable for early-stage architecture selection and scaling studies.

From a deployment perspective, the results strongly suggest that pointwise reranking methods are preferable for production systems where compute is a bottleneck. The severe efficiency penalties incurred by pairwise and listwise methods, especially as model size increases, render them impractical for large-scale or latency-sensitive applications unless further algorithmic or architectural optimizations are introduced.

Theoretical and Future Directions

The work formalizes the efficiency-effectiveness frontier for LLM-based reranking, providing a foundation for future research on compute-efficient ranking algorithms. The closed-form FLOPs estimator can be extended to more complex architectures, such as mixture-of-experts or retrieval-augmented models, though the authors note potential limitations in accuracy for such cases.

Future research directions include:

Refining the estimator via regression against real measurements for advanced architectures.
Incorporating additional system-level constraints (e.g., memory bandwidth, energy consumption) into the efficiency metrics.
Exploring algorithmic innovations that close the efficiency-effectiveness gap, such as hybrid reranking pipelines or adaptive model selection based on query complexity.

Limitations

The estimator assumes consistent implementation across frameworks and may not capture library-level optimizations or hardware-specific kernel differences. While FLOPs provide a stable proxy for compute, they do not account for all real-world constraints, such as memory and energy bottlenecks or dynamic system loads.

Conclusion

This work establishes a rigorous, interpretable framework for evaluating the efficiency-effectiveness tradeoff in LLM-based reranking, grounded in FLOPs-normalized metrics and a validated estimator. The findings underscore the necessity of considering computational cost in reranker design and provide actionable guidance for both researchers and practitioners seeking to deploy LLM-based retrieval systems at scale.

Markdown Report Issue