Ranking Metrics per PetaFLOP (RPP)

Updated 13 July 2025

Ranking Metrics per PetaFLOP (RPP) are quantitative measures that benchmark the efficiency and effectiveness of computational systems by normalizing performance to 10^15 FLOPs.
RPP integrates traditional metrics like NDCG, HPL, and MAP with FLOP estimates to enable hardware-agnostic comparisons across supercomputing and AI workloads.
RPP guides system design and optimization by illustrating trade-offs between effectiveness and compute cost in high-performance computing, scientific AI, and LLM-based reranking.

Ranking Metrics per PetaFLOP (RPP) are quantitative measures designed to evaluate the efficiency, effectiveness, and competitiveness of computational and information retrieval systems at petaflop scales. The term encompasses a class of metrics and evaluation methodologies that normalize system or algorithmic performance by the number of floating-point operations (FLOPs) in the petaflop range (10¹⁵ FLOPs), allowing comparisons across hardware architectures, application domains, and algorithmic strategies. RPP serves as a foundational tool for benchmarking high-performance computing (HPC) systems, large-scale AI training, information retrieval ranking systems, and their hybrid applications in scientific and industrial contexts.

1. Principles and Purpose of RPP

The overarching goal of RPP is to provide a normalized framework for comparing the relative performance and efficiency of systems or algorithms operating at petaflop scale. Traditional metrics such as latency, throughput, or application-level scores (e.g., NDCG or HPL) alone do not account for the compute cost in a hardware- and scale-agnostic manner. RPP bridges this gap by relating measured performance—be it end-to-end application throughput, ranking effectiveness, or benchmark scores—directly to petaflop-scale computational expenditure.

In high-performance computing, RPP captures the ratio of sustained performance (typically in FLOP/s) realized under practical workload conditions to theoretical or measured peak performance, often considering energy efficiency as an additional dimension (Ponce et al., 2019, Banchelli et al., 13 Mar 2025, Konishi, 2 Jul 2025). In LLM–based reranking and retrieval tasks, RPP quantifies ranking metric gains per petaFLOP consumed, highlighting the trade-off between effectiveness and computational cost (Peng et al., 8 Jul 2025).

2. Mathematical Formulation and Benchmark Integration

The mathematical backbone for RPP is grounded in FLOPs-based normalization of performance metrics:

General RPP Formula (Editor's term):

$\text{RPP} = \frac{m(q)}{C_q/10^{15}}$

Where $m(q)$ is the application-specific metric (e.g., NDCG, HPL, MAP) for a query or workload $q$ , and $C_q$ is the FLOP count per query or task (Peng et al., 8 Jul 2025).

Supercomputing Practice:

For cluster or supercomputer evaluation, compute efficiency is often computed as:

$\eta = \frac{R_{\mathrm{max}}}{R_{\mathrm{peak}}}$

Where $R_{\mathrm{max}}$ is the measured sustained performance in petaflops, and $R_{\mathrm{peak}}$ the theoretical peak (Ponce et al., 2019, Banchelli et al., 13 Mar 2025, Konishi, 2 Jul 2025).

Energy-Related RPP:

Energy efficiency may be incorporated as:

$E_{PF} = \frac{P_{\mathrm{total}}}{R_{\mathrm{max}}}$

Where $P_{\mathrm{total}}$ is the total power consumed (Ponce et al., 2019, Banchelli et al., 13 Mar 2025).

Information Retrieval Context:

In LLM-based reranking, RPP is explicitly defined using a hardware-agnostic FLOPs estimator, with queries per petaFLOP (QPP) introduced as a throughput counterpart:

$\text{QPP} = 1 / (\mathrm{AVG}(C_q/10^{15}))$

(Peng et al., 8 Jul 2025).

RPP thus flexibly adapts across domains by pairing a field-relevant effectiveness metric with the measured or estimated petaFLOPs consumed.

3. Applications in Supercomputing and AI Workloads

Supercomputing Systems

Prominent HPC systems such as Niagara (Ponce et al., 2019), MareNostrum5 (Banchelli et al., 13 Mar 2025), and SAKURAONE (Konishi, 2 Jul 2025) employ RPP analysis to assess both their absolute and relative performance. RPP is pivotal in:

TOP500 Rankings: Where sustained HPL performance determines a system's global standing, and efficiency ratios provide context for competitive differentiation.
Real-World Applications: Application studies (e.g., Alya, OpenFOAM, IFS) on MareNostrum5 demonstrate that near-theoretical floating-point performance, memory bandwidth optimization, and scalable interconnects underpin high RPP scores.
AI Training: Benchmarks such as HPL-MxP (used for SAKURAONE) assess AI-relevant, low-precision workloads. The approximate 10x speedup in FP8 over FP64 for HPL-MxP underlines the impact of architectural advances on the RPP for deep learning applications.

Scientific AI and Containerized Workloads

Petaflop-scale deployment of scientific neural networks, such as the 3DGAN for high-energy physics on secure HPC clusters, quantifies efficiency per petaflop by comparing the measured and theoretical FLOP rates for hyperoptimized kernels (Brayford et al., 2020). RPP in this context guides the closing of the gap between measured peak and actual performance, with containerization and MPI library compatibility as critical enablers for scaling efficiency.

LLM-based Rerankers and Information Retrieval

In the domain of LLM-based reranking, E²R-FLOPs (Peng et al., 8 Jul 2025) formalizes RPP to assess the efficiency-effectiveness trade-off. The paper’s FLOPs estimator allows researchers to gauge model efficiency per petaFLOP before running large-scale experiments. Empirical results confirm that simple rerankers can achieve high RPP, whereas sophisticated pairwise or listwise rerankers may exhibit diminishing marginal improvements in effectiveness at dramatically increased FLOPs cost.

4. Evaluation, Trade-offs, and Limitations

Across domains, the utility of RPP lies in its ability to benchmark not only performance but also efficiency and practical deployability:

Comparative Analysis: RPP enables fair comparison across architectures and workloads by decoupling effectiveness from hardware specifics such as batch size or parallelism (Peng et al., 8 Jul 2025).
Trade-off Surface: Analysis reveals that increasing model capacity or system complexity often leads to sublinear effectiveness gains per petaFLOP, evident in both large-scale AI and traditional HPC benchmarks (Peng et al., 8 Jul 2025, Banchelli et al., 13 Mar 2025).
Limitations: The accuracy of RPP can be bounded by the fidelity of the FLOPs estimator, non-uniformity in workload characteristics, and exclusion of factors like energy consumption, memory usage, or non-standard architectures (e.g., mixture-of-experts) (Peng et al., 8 Jul 2025, Banchelli et al., 13 Mar 2025). Real-world conditions such as memory bandwidth or interconnect saturation may also influence realized RPP.

5. Benchmarking Methodologies and Performance Metrics

RPP evaluation incorporates a range of methodologies depending on domain:

HPC Benchmarks: HPL (LINPACK), HPCG, and HPL-MxP form the canonical basis for RPP assessment in supercomputing. Performance is measured on both double-precision and low-precision (AI-relevant) tasks, with efficiency evaluated as a ratio to peak theoretical or observed potential (Ponce et al., 2019, Banchelli et al., 13 Mar 2025, Konishi, 2 Jul 2025).
Microbenchmarks: Floating-point unit testing, memory bandwidth (e.g., STREAM, MEM), and interconnect latency/bandwidth measurements provide fine-grained RPP component analysis (Banchelli et al., 13 Mar 2025).
Application Studies: Full-scale scientific applications illustrate RPP under practical, heterogeneous workloads—serving as a proxy for real-user efficiency.
Information Retrieval: Ranking effectiveness metrics per compute (e.g., NDCG@k / petaFLOP) and throughput (queries per petaFLOP) become central in AI search and LLM-based reranking (Peng et al., 8 Jul 2025).

Tables of benchmark results, such as per-node sustained GFLOP/(s×W), scaling efficiency, and energy-delay product, provide detailed RPP-related insights.

6. Implications for System Design, Optimization, and Future Work

RPP-driven analysis underpins architectural and algorithmic decisions in both industry and research:

System Optimization: RPP motivates hardware-software co-design, such as leveraging high-bandwidth memory (HBM) for memory-bound workloads (Banchelli et al., 13 Mar 2025), adopting open networking stacks for parallel efficiency (Konishi, 2 Jul 2025), or tuning parameter choices in LLM-based rerankers (Peng et al., 8 Jul 2025).
Model Selection: Practitioners can prioritize models or configurations delivering higher effectiveness per petaFLOP, particularly in settings where compute budget and energy use are critical constraints (Peng et al., 8 Jul 2025).
Benchmark Evolution: As new workloads emerge (e.g., foundation models, dense-sparse expert mixtures), the concept of RPP is adaptable to future benchmarking methodologies, though extensions may be needed to consistently capture advances in mixed-precision, heterogeneous compute, or distributed systems.

RPP is situated among a broader ecosystem of efficiency metrics:

Efficiency Ratios: Computational efficiency (η), energy efficiency (GFLOP/(s×W)), and energy-delay product (EDP) (Banchelli et al., 13 Mar 2025).
Effectiveness-agnostic Metrics: Queries per petaFLOP (QPP) provide a task throughput normalization complementary to RPP (Peng et al., 8 Jul 2025).
Metric-free Evaluation: In information retrieval contexts, recall-paired preference (RPP)—distinct from ranking metrics per petaFLOP—offers a non-metric approach focused on user-centered ranking comparisons but retains computational scalability amenable to petaFLOP contexts (Diaz et al., 2022). This underscores the terminological breadth of "RPP," with domain-dependent instantiations.

In summary, Ranking Metrics per PetaFLOP (RPP) form a foundational layer for assessing and advancing the efficiency, effectiveness, and competitiveness of computational systems and algorithms operating at petaflop scale. By normalizing performance to fundamental compute cost, RPP guides optimization, benchmarking, and deployment decisions across HPC, AI, and information retrieval domains.