E²R-FLOPs: LLM Reranking Efficiency Metrics

Updated 18 January 2026

E²R-FLOPs is a computational framework that measures LLM reranking efficiency using normalized FLOPs metrics.
It introduces RPP and QPP to separately quantify reranking effectiveness and query throughput in a device-independent manner.
The methodology details closed-form FLOPs estimation for both decoder-only and encoder-decoder architectures, including quadratic attention costs.

E²R-FLOPs (Efficiency–Effectiveness Reranking FLOPs) is a standardized computational framework and set of metrics for quantifying and comparing the compute efficiency of LLM–based reranking systems in information retrieval. Unlike traditional proxy metrics such as latency or token counts—which confound comparisons due to hardware and implementation variability—E²R-FLOPs measures effectiveness strictly as a function of floating-point operations (FLOPs), normalized at the scale of peta-FLOPs, and yields direct, interpretable metrics for the efficiency–effectiveness tradeoff in reranking (Peng et al., 8 Jul 2025).

1. Formal Metrics: RPP and QPP

E²R-FLOPs introduces two normalized metrics:

Ranking metrics per PetaFLOP (RPP):

$\mathrm{RPP}(q) = \frac{m(q)}{C_q / 10^{15}}$

where $m(q)$ is an effectiveness measure for the query (such as NDCG@10), and $C_q$ is the total floating-point operations needed for reranking a query—including both prompt encoding and output decoding. The resulting units are effectiveness points per PetaFLOP.

Queries per PetaFLOP (QPP):

$\mathrm{QPP} = \frac{1}{\overline{C_q}/10^{15}} = \frac{10^{15}}{\overline{C_q}}$

where $\overline{C_q}$ is the average FLOPs per query over a query set. This measures hardware-agnostic throughput, or how many queries can be processed per PetaFLOP of compute.

These definitions decouple the evaluation of reranker efficiency from hardware, implementation, or batching effects and provide a device-independent basis for comparison (Peng et al., 8 Jul 2025).

2. Analytical FLOPs Estimation for LLM Rerankers

E²R-FLOPs pairs its efficiency metrics with a closed-form, self-contained FLOPs estimator for LLM rerankers, supporting both decoder-only and encoder–decoder transformer architectures.

Decoder-Only Estimation

Given:

$n_{\rm layer}$ : number of layers,
$d_{\rm model}$ : hidden-state dimension,
$d_{\rm ff}$ : feed-forward dimension,
$d_{\rm attn}$ : attention projection dimension,
$n_p$ : prompt prefix length,
$n_q$ : query length,
$w$ : number of candidate documents,
$L_{\rm doc}$ : average document length,
$n_{\rm opt}$ : output length (generation steps).

The total per-query FLOPs is:

Parameter count:

$N_{\rm dec} = 2\, d_{\rm model}\, n_{\rm layer}(2\, d_{\rm attn} + d_{\rm ff})$

Prompt (context) length:

$n_{\rm ctx} = n_p + n_q + w\, L_{\rm doc}$

Prompt processing cost:

$C({\rm ctx}) = 2\,N_{\rm dec}\, n_{\rm ctx} + 4\,n_{\rm layer}\, n_{\rm ctx}^2\, d_{\rm attn}$

Output generation cost:

$C({\rm opt}) = 2\,N_{\rm dec}\, n_{\rm opt} + 2\, n_{\rm layer}\, d_{\rm attn}\,\left[ 2 n_{\rm opt} n_{\rm ctx} + n_{\rm opt}(n_{\rm opt}-1) \right]$

Total:

$C_q = C({\rm ctx}) + C({\rm opt})$

Encoder–Decoder Extension

The encoder cost mirrors $C({\rm ctx})$ with its own parameters. The decoder processes output tokens with additional cross-attention (KV) cost:

$C_{\rm cross\text{-}KV} = 4\, n_{\rm layer}\, n_{\rm ctx}\, d_{\rm model}\, d_{\rm attn}$

with the total:

$C_q = C({\rm ctx}) + C_{\rm cross\text{-}KV} + C({\rm opt})$

The estimator assumes full attention (no sparse optimizations), uses published LLM hyperparameters, and discounts kernel-specific or hardware optimizations, yielding a hardware-agnostic FLOPs count (Peng et al., 8 Jul 2025).

3. Practical Recipe for Computing E²R-FLOPs

To apply E²R-FLOPs metrics in practice:

Step 1: Gather or specify reranking task statistics: $n_p,\,n_q,\,w,\,L_{\rm doc},\,n_{\rm opt}$ and model hyperparameters.
Step 2: Compute layer parameter count $N_{\rm dec}$ .
Step 3: Calculate $C({\rm ctx})$ and $C({\rm opt})$ (plus $C_{\rm cross\text{-}KV}$ for encoder–decoder).
Step 4: Obtain $C_q = C({\rm ctx}) + C({\rm opt})$ (sum with $C_{\rm cross\text{-}KV}$ as needed); normalize: $C_q^{\rm PF} = C_q / 10^{15}$ .
Step 5: Report effectiveness $m(q)$ and compute RPP and QPP:

$\mathrm{RPP} = \frac{m(q)}{C_q^{\rm PF}},\quad \mathrm{QPP} = \frac{1}{C_q^{\rm PF}}$

This methodology enables direct, interpretable, device-independent comparisons across different LLM-based rerankers (Peng et al., 8 Jul 2025).

4. Advantages over Traditional Proxy Metrics

E²R-FLOPs addresses fundamental deficiencies in prior efficiency metrics:

Hardware-agnosticism: FLOPs represent theoretical compute work, immune to device variations.
Model-size awareness: The FLOPs estimator captures the cost difference between models of dramatically different scales (e.g., a 70B LLM incurs $~20\times$ the FLOPs of a 3B model per forward pass), a detail hidden by token- or call-count proxies.
Comprehensive accounting: Both encoding and decoding costs, including quadratic attention terms, are explicitly included.
Direct effectiveness/efficiency tradeoff: RPP quantifies standardized effectiveness-per-compute; QPP measures throughput-per-compute; together, they define an efficiency–effectiveness frontier unbiased by batching or parallelization.
Enables fair comparison: Efficiency frontiers (RPP vs QPP plots) allow principled comparison of reranking methodologies, model configurations, and resource tradeoffs (Peng et al., 8 Jul 2025).

5. Illustrative Evaluation on DL19

In a comparative evaluation on TREC DL19:

Method	NDCG@10	$C_q^{\rm PF}$ (PetaFLOPs)	RPP	QPP
Pointwise “yes_no”	0.654	0.009	72.67	111.1
Pairwise “allpair”	0.666	1.865	0.36	0.54

Interpretation: The pointwise method, at 0.654 NDCG@10, achieves 72.7 NDCG points per PetaFLOP and can process $\sim$ 111 queries per PetaFLOP—representing high efficiency. In contrast, the pairwise “allpair” method improves NDCG@10 to 0.666 but its FLOPs cost increases by roughly $200\times$ , reducing RPP below 1 and QPP below 1—each PetaFLOP yields less than one query or a single effectiveness point.

Plotting all methods in $(\text{RPP},\ \text{QPP})$ space reveals an efficiency–effectiveness frontier: pointwise methods cluster in the upper-right (high RPP/high QPP), while pairwise/listwise formulations and larger LLMs move the frontier lower, demonstrating that quality gains may entail superlinear compute costs (Peng et al., 8 Jul 2025).

6. Significance and Impact

E²R-FLOPs establishes a rigorous, interpretable, and hardware-agnostic foundation for evaluating LLM-based reranking in information retrieval. The introduction of RPP and QPP, together with a transparent FLOPs estimation protocol, enables researchers to precisely characterize and compare reranking architectures, control for model scale, and navigate the inherent tradeoff between reranking effectiveness and computational efficiency. These metrics promote principled resource allocation and highlight approaches along the efficiency frontier, fostering clearer reporting and more meaningful progress in LLM-based reranking research (Peng et al., 8 Jul 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Efficiency-Effectiveness Reranking FLOPs for LLM-based Rerankers (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to E²R-FLOPs.