QUEST Benchmark Overview

Updated 15 November 2025

QUEST Benchmark is a set of standardized evaluation suites for quantum simulation, multireference quantum chemistry, and retrieval QA, defined by precise datasets and protocols.
The QuEST suite demonstrates scalable simulation on classical hardware using hybrid parallelism, GPU acceleration, and distributed MPI to achieve up to 38 qubits.
QUEST #4X and QUEST-LOFT extend the framework to open-shell molecular excitations and multi-hop QA, providing robust calibration for method development and interpretability.

The QUEST benchmark refers to several rigorously defined evaluation suites used in quantum many-body methods, multireference quantum chemistry, and large-scale question answering. Each instantiation focuses on a well-scoped suite of challenging cases designed to probe the practical limits of state-of-the-art algorithms in its area. This article surveys the three notable QUEST benchmarks: (1) computational quantum simulation in classical hardware using QuEST, (2) multireference excited-state quantum chemistry using QUEST #4X, and (3) structured information retrieval with the QUEST-LOFT QA benchmark. Each benchmark is characterized by precisely defined datasets, evaluation methodology, reference results, and implications for algorithmic development.

1. Overview and Historical Context

The QUEST (Quantum Exact Simulation Toolkit) benchmark (Jones et al., 2018) is rooted in the simulation of universal quantum circuits on classical hardware, providing a standard for performance and scalability comparisons among quantum circuit simulators. In quantum chemistry, the QUEST #4X benchmark (Song et al., 2024) extends the established QUEST series to open-shell radicals, calibrating multireference methods for vertical excitation energies (VEEs) using near-exact selected configuration interaction protocols. In large-scale LLM evaluation, QUEST (Malaviya et al., 2023) and its repackaging within the LOFT framework (Scales et al., 8 Nov 2025) assess retrieval-augmented and long-context architectures on compositional multi-entity questions.

Cumulatively, these benchmarks shape the methodology for cross-comparison of quantum circuit simulators, quantum chemistry solvers, and retrieval-based QA systems, drawing clear boundaries around computational tasks, system size, and correctness criteria.

2. The QuEST Benchmark: Exact Quantum Circuit Simulation

The QuEST benchmark evaluates the QuEST simulator’s capabilities for classical simulation of generic quantum circuits of $n$ qubits, using an architecture built on C99 with OpenMP, MPI, and CUDA (NVIDIA GPUs). The toolkit is designed for seamless portability from single-laptop CPUs to multi-node supercomputers (up to 2048 nodes).

Key features:

Hybrid parallelism: QuEST uses OpenMP multi-threading on a single node and partitions the state vector equally among $P$ MPI ranks in a distributed setting, with $2^n$ total amplitudes. High-qubit gate application triggers paired MPI exchanges to minimize message count at the cost of a $2\times$ local memory overhead.
Resource scaling: For pure states, memory usage is $M_{\text{pure}} = 16 \cdot 2^n$ bytes. For mixed states, QuEST uses a vectorized density matrix under the Choi–Jamiolkowski isomorphism, requiring $M_{\text{mixed}} = 16 \cdot 2^{2n}$ bytes.
Arithmetic cost: Each single-qubit gate is $O(2^n)$ operations; distributed communication time for high-qubit gates is $T_{\rm comm} \approx L + S/B$ for local slice $S$ , latency $L$ , and bandwidth $B$ .
GPU acceleration: On a Tesla K40m, QuEST achieves a $5\times$ speedup compared to 24 CPU threads.

Benchmark protocol:

Random “universal” circuits (quantum chaos circuits) are assembled from $\{H, T, \text{CZ}, X^{1/2}, Y^{1/2}\}$ , with total gate complexity $G \sim O(n\,d)$ for qubits $n$ and depth $d$ .
Tests are performed on platforms ranging from single-node (24-core CPU or GPU) to distributed architectures (up to 2048 nodes, simulating up to 38 qubits and 8 TiB of state vector).
Performance is quantified as gate time $T_{\text{gate}}(n) \propto 2^n$ and distributed scaling $T(P) \propto P^{-\alpha}$ with $\alpha \approx 0.95$ up to practical memory and latency limits.

Comparative results:

Single node: For $n < 22$ , Python-based ProjectQ is 2–10× faster single-thread due to cache-blocking optimizations, but QuEST overtakes ProjectQ at 16–24 threads. For $n > 22$ , both are memory bound.
GPU: One GPU achieves approximately $5\times$ the throughput of the best 24-threaded CPU node.
Distributed: QuEST’s low-message-count, full-slice cloning matches qHipster in large-node regimes and outperforms fine-grained MPI strategies (Quantum++) by several orders of magnitude.

A summary table of strong-scaling results is provided below:

Node Count ( $P$ )	Max Qubits Simulated	Scaling Exponent ( $\alpha$ )
512	30	$\approx 0.95$
2048	38	$\approx 0.93$

This architecture enables practical simulation guidance: GPU hardware suffices for $n \lesssim 29$ , single CPUs remain viable at $n \lesssim 30$ , and distributed MPI is essential beyond these limits.

3. QUEST #4X: Multireference Quantum Chemistry Benchmark

QUEST #4X (Song et al., 2024) is designed as an open-shell extension of the QUEST #4 datasets, forming a reference for benchmarking low-lying excited states in 24 organic radicals. The dataset specifies 110 doublet and 39 quartet excitations, each calculated with near-exact configuration interaction and perturbative corrections.

Key aspects:

Dataset composition: 24 radicals, 149 vertical excitation energies (VEEs), constructed by exhaustive iCIPT₂/AVTZ calculations in every symmetry and energy window.
Reference protocol: Energies $E_\text{iCIPT2}=E_\text{var}+\Delta E^{(2)}$ are extrapolated linearly in perturbative correction $\Delta E^{(2)}\to0$ , with uncertainty $<0.05$ eV.
Method calibration: The minimal multireference methods SDSCI (static–dynamic–static CI) and SDSPT₂ (static–dynamic–static second-order PT) are benchmarked; SDSCI is a one-iteration surrogate for ic-MRCISD, while SDSPT₂ provides MS-NEVPT2-matched perturbative corrections.

Global error statistics for 149 states:

Method	Mean	MAD	STD	Max
SDSCI	0.02	0.05	0.08	0.42
SDSPT₂	0.09	0.10	0.13	0.51
ic-MRCISD	—	0.04	—	0.30
MS-NEVPT2	—	0.09	—	—

SDSCI recovers 90% of ic-MRCISD accuracy at 1/3 computational cost. SDSPT₂ is virtually indistinguishable from MS-NEVPT2 for nearly all states but displays enhanced robustness for near-degenerate manifolds and Rydberg-valence mixed cases.

The dataset prescribes best practices for future multireference benchmarks:

Use iCIPT₂-like protocols for references.
Employ occupation-guided active space (≤15 orbitals) selection.
State-dependent Fock operators for high Rydberg content.
Apply Pople-type size-consistency corrections to contracted CI energies.

A plausible implication is that the QUEST #4X dataset will serve as a definitive testbed for new minimal MRPT2 and MRCI treatments of open-shell organic molecules.

4. QUEST-LOFT: Multi-hop Retrieval QA Benchmark

QUEST-LOFT (Scales et al., 8 Nov 2025) repackages the original QUEST (Malaviya et al. 2023) into the LOFT token-constrained QA framework. Each example comprises a compositional multi-entity question, referencing logical combinations (“AND/OR”) of Wikipedia categories and requiring retrieval from a candidate pool (328 entities, ≈128K tokens per question).

Challenges:

Answers require integrating evidence from dozens of distributed Wikipedia articles; many cases demand symbolic set-algebraic reasoning over entity lists.
LOFT’s passage filtering can reduce context and disrupt document flow.
Data scarcity in dev/test (10 dev, 100 test) raises sensitivity to hyperparameter and prompt selection.

RAG protocol:

Top-40 document retrieval using Gecko “text-embedding-004” vectors, $\mathbf{v}_q^{\top}\mathbf{v}_d$ , as similarity.
“Justified QA” format: the model enumerates candidate entities for each question, links evidence for/against from text snippets, provides a reasoning field, and assigns a binary TRUE/FALSE label.
Optional per-entity answer re-verification based strictly on cited evidence.

Results on the QUEST-LOFT-128K-Revised set (Gemini 1.5 Pro):

Method	F1	Precision	Recall	Accuracy	EM
CiC Baseline	0.66	0.78	0.70	0.38	0.50
CiC + Justified QA	0.74	0.88	0.74	0.42	0.53
RAG Baseline	0.67	0.79	0.72	0.41	0.55
RAG + Justified QA	0.81	0.92	0.79	0.55	0.64
RAG + Justified QA + Verif.	0.83	0.93	0.82	0.57	0.67

The structured output format with explicit evidence and reasoning yields substantial increases in both precision (+0.14 F1) and interpretability over naïve generation or long-context approaches.

5. Comparative Methodological Insights

Analysis across the three QUEST contexts highlights:

Precision in task definition is essential for benchmarking, with problem size, data partition, and correctness criteria fixed rigorously.
Structured outputs (e.g., JSON with reasoning/evidence, or explicit CI block configurations) are critical for interpretability and downstream evaluation.
Hybrid and parallel/distributed computational designs (OpenMP/MPI, half/full-slice cloning) must balance memory, communication cost, and practical deployment.
Reference implementations (iCIPT₂, full state-vector kernels, embedding-based retrieval) establish quantitative standards for both accuracy and computational efficiency.

In QA, retrieval remains essential even for long-context transformer models. In quantum simulation and quantum chemistry, classical exactness is dictated by exponential scaling in memory and compute, while reference datasets define the outer frontier for benchmarks.

6. Impact and Implications for Future Research

The QUEST benchmarks collectively define state-of-the-art expectations in their domains:

Quantum simulation: QuEST codifies resource limits and provides actionable guidance on hardware mapping, communication bottlenecks, and scaling performance for $n\to38$ qubits.
Quantum chemistry: QUEST #4X sets a baseline for new MR methods, supports method calibration on open-shell contexts, and bridges the accuracy/cost gap in electronic structure modeling.
Information retrieval and QA: QUEST-LOFT reveals the need for retrieval-augmented methods with grounded, structured reasoning, and highlights compositional multi-entity QA as a challenging open problem.

For each context, methodological rigor in dataset construction, reference protocol, and metrics ensures reproducibility and meaningful progress. A plausible implication is that future advancement in simulation algorithms, chemical solvers, and multi-step QA models will be measured first against the QUEST suite before broader application.

7. Summary Table: QUEST Benchmarks at a Glance

Domain	Benchmark	Key Task	Reference Methodology
Quantum Simulation	QuEST	Exact $n$ -qubit circuit simulation	OpenMP/MPI/CUDA, $O(2^n)$ scaling, distributed slicing
Quantum Chemistry	QUEST #4X	VEEs in 24 radicals, open-shell states	iCIPT₂/AVTZ, SDSCI, SDSPT₂, error $<$ 0.1 eV
Information Retrieval	QUEST-LOFT	Multi-hop, entity-set QA over Wikipedia	Embedding-based RAG, structured justification, per-entity verification

These benchmarks serve as calibration points for their fields, structuring advances in both the design of algorithms and their empirical assessment.

Markdown Upgrade to Chat

References (3)

QuEST and High Performance Simulation of Quantum Computers (2018)

QUEST\#4X: an extension of QUEST\#4 for benchmarking multireference wavefunction methods (2024)

Evaluation of retrieval-based QA on QUEST-LOFT (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to QUEST Benchmark.