Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
86 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
53 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Queries per PetaFLOP (QPP) Efficiency Metric

Updated 13 July 2025
  • Queries per PetaFLOP (QPP) is a hardware-agnostic efficiency metric that measures the number of distinct computational tasks processed per 10^15 FLOPs.
  • It enables fair comparisons across diverse algorithms and hardware by normalizing workload to a constant computational budget.
  • Applications span from optimizing LLM-based rerankers in information retrieval to enhancing throughput in large-scale scientific simulations.

Queries per PetaFLOP (QPP) is a hardware-agnostic efficiency metric that measures the number of queries or distinct computational tasks a system can process per one petaFLOP (10¹⁵ floating-point operations) of compute. QPP provides a normalized basis for evaluating throughput and efficiency—across varying algorithms, model architectures, and hardware environments—by relating computational workload to the fundamental resource of floating-point operations, rather than wall time or device-specific throughput. This metric has gained prominence in the context of large-scale scientific computing and, more recently, in the evaluation of LLM–based rerankers in information retrieval systems.

1. Formal Definition and Mathematical Formulation

QPP is formally defined as the ratio of a fixed computational budget (one petaFLOP) to the average FLOP cost per query. Let AVG(C(q))\operatorname{AVG}(C_{(q)}) denote the average number of floating-point operations required to process a query by a given method. Then, the metric is given by:

QPP=1015AVG(C(q))\mathrm{QPP} = \frac{10^{15}}{\operatorname{AVG}(C_{(q)})}

Alternatively, it can be expressed as:

QPP=1/(AVG(C(q))/1015)\mathrm{QPP} = 1/\left(\operatorname{AVG}(C_{(q)})/10^{15}\right)

A higher QPP indicates that more queries can be processed for a given petaFLOP, reflecting greater computational efficiency. The metric is directly comparable across method classes, model sizes, and hardware, as it abstracts away from device-dependent measures such as latency or energy usage (Peng et al., 8 Jul 2025).

2. Motivation and Distinction from Traditional Efficiency Metrics

Traditional benchmarks for efficiency, such as latency (seconds/query), throughput (queries/sec), forward pass counts, or input/output token counts, are often confounded by factors like hardware parallelism, batch size, memory bottlenecks, and inference modes. These proxies can obscure genuine computational efficiency, particularly for architectures that scale differently or are run in heterogeneous environments. QPP, in contrast, is “floating-point operation normalized”—providing an intrinsic measure not tied to physical wall time or particular hardware accelerators.

For LLM-based rerankers, QPP allows fair comparisons between, for example, pointwise, pairwise, and listwise inference algorithms, as well as across different LLM sizes and architectures (e.g., encoder–decoder vs. decoder-only), regardless of device-specific optimizations (Peng et al., 8 Jul 2025).

3. Application in Large-Scale Scientific and AI Workflows

QPP has been implicitly or explicitly considered in large-scale scientific workflows that require extreme computational throughput. For instance, in petascale physics simulations such as Lattice QCD, sky cosmology, and quantum transport calculations, QPP is closely tied to the system’s ability to execute vast numbers of discrete simulation or analytic tasks per petaFLOP (1012.0253, 1211.4864, Villalonga et al., 2019, Ziogas et al., 2019, 1501.03345).

Examples include:

  • APEnet+ interconnects: System-level QPP can be conceptualized as the number of remote queries (data transactions or communication tasks) the interconnect supports per petascale unit of compute. Achieving low latency and high bandwidth in the networking stack maximizes QPP, as formalized in the estimate QPPBLPscale\mathrm{QPP}\sim\frac{B}{L\cdot P_{scale}} (with BB bandwidth, LL latency, and PscaleP_{scale} the petascale compute factor) (1012.0253).
  • Extreme-scale simulations (HACC, PAMOP): Algorithmic optimizations that promote concurrency and efficient parallelism increase the number of independent simulation tasks (“queries”) completed per petaFLOP of compute, driven by high throughput in matrix algebra and particle updates (1211.4864, 1501.03345).

4. Advances in QPP for LLM-Based Rerankers

With the emergence of powerful LLM-based rerankers for information retrieval, QPP has been explicitly adopted as a metric for comparative evaluation (Peng et al., 8 Jul 2025). In this context, a “query” typically denotes a complete rerank pass over a candidate set. For each method, QPP is computed by dividing 10¹⁵ FLOPs by the estimated average FLOPs required per query (using a closed-form FLOPs estimator that accounts for model configuration, layer count, hidden size, attention windows, and input/output lengths).

Empirical studies show:

  • Pointwise reranking: Methods that independently score each candidate per query tend to yield the highest QPP (e.g., ~111 queries/petaFLOP using Flan-T5-large), indicating that such designs are highly efficient for computational throughput.
  • Pairwise/Listwise methods: Strategies involving all-pairs comparisons or full permutation scoring are significantly less efficient, yielding QPP values as low as 0.15, due to massive duplicate computation.
  • Model scaling trade-off: Larger models (e.g., moving from Flan-T5-large to Flan-T5-xxl) may improve result quality (as measured by ranking metrics), but their FLOP-per-query cost grows superlinearly, sharply reducing QPP. This reveals the efficiency–effectiveness trade-off central to high-throughput search deployments.

5. Comparison Tables: QPP Across Architectures and Methods

Method Class LLM Size QPP (queries/petaFLOP) Notes
Pointwise Reranker Flan-T5-large ~111 Highest efficiency observed (Peng et al., 8 Jul 2025)
Pairwise Reranker Flan-T5-large ~0.15 Expensive all-pair computation
Listwise Reranker Flan-T5-xxl <<1 Large model, deep computation

QPP estimates refer to specific configurations as examined in (Peng et al., 8 Jul 2025).

6. Implications for System Design and Practical Deployment

QPP enables both researchers and practitioners to make informed decisions regarding the adoption and deployment of computationally expensive models and algorithms. In search and retrieval systems, high QPP methods allow for scalable real-time reranking, while low QPP architectures may be unsuitable for production environments requiring rapid throughput.

Furthermore, QPP can guide model selection, architectural refinement, and optimization strategies, such as:

  • Choosing model sizes or inference algorithms that meet target QPP thresholds for given compute budgets,
  • Balancing ranking effectiveness against computational cost via the joint use of QPP and relevance-per-petaFLOP (RPP) metrics,
  • Comparing novel methods with established baselines on a hardware-agnostic basis, facilitating reproducible and interpretable research (Peng et al., 8 Jul 2025).

7. Outlook and Metric Evolution

The continued increase in model scale and resource constraints underscores the importance of FLOPs-normalized metrics such as QPP. Future developments may include:

  • Adapting QPP to multi-modal queries and composite tasks,
  • Extending estimators for QPP to capture additional layers of system heterogeneity (e.g., memory, data movement, hardware acceleration),
  • Integrating QPP evaluation into broader efficiency–effectiveness frontiers, potentially using composite or multi-objective evaluation schemes.

The standardization of QPP promotes transparent and fair benchmarking of both traditional and deep learning-powered retrieval and simulation systems. Its interpretability and hardware-independence position it as a central metric in the methodological toolkit for extreme-scale computing research and deployment.