Metriq: Collaborative Quantum Benchmarking
- The platform's main contribution is its modular architecture that standardizes benchmarking across quantum hardware, ensuring reproducibility and transparency.
- Metriq provides a comprehensive suite of benchmarks, including system-level and application-inspired metrics, for evaluating quantum device performance.
- Its integration of metriq-gym, metriq-data, and metriq-web enables seamless aggregation, visualization, and cross-platform comparisons to drive hardware and software improvements.
Metriq is an open-source, community-driven collaborative platform for reproducible and transparent benchmarking of quantum computers across hardware modalities, device generations, and cloud providers. Its modular architecture and open dataset enable systematic assessment of both system-level and application-inspired metrics, producing curated and version-controlled results that inform hardware development, software optimization, and standardization efforts. Metriq integrates the definition, execution, and aggregation of benchmarks, emphasizing scalability, practicality, and explainability in quantum performance evaluation (Cosentino et al., 9 Mar 2026). It also serves as the primary delivery and aggregation platform for application-oriented benchmarking standards within BACQ, a key European initiative for quantum metrology (Barbaresco et al., 2024).
1. Architectural Overview and Workflow
Metriq is structured around three primary interoperable components:
- metriq-gym: This Python-based runner and SDK implements all benchmark definitions as parameterized modules, exposes a uniform CLI, and abstracts backend dispatch for both cloud quantum processors (QPU) and local simulators. Each benchmark ships with an explicit JSON-schema for input parameters, ensuring reproducibility and configuration–code separation.
- metriq-data: A publicly hosted, version-controlled repository (Git/JSON) that organizes benchmark results hierarchically via source, provider, device, benchmark version, and timestamp. All files are JSON-schema validated and include metadata for provenance, compilation details, and runtime environment.
- metriq-web: The front-end visualization and access platform, consuming the dataset to render interactive, filterable summaries, device–benchmark time series, leaderboard tables, and scalar composite scores (the “Metriq Score”). Download/export features permit downstream analysis and external tool integration.
The typical workflow consists of:
- Community RFC-driven proposal and review of new benchmarks.
- Suite dispatch by operators to backends (e.g., IBM Quantum Cloud, AWS Braket, Quantinuum, Rigetti, IonQ, IQM). Jobs are polled and fetched asynchronously via CLI.
- Results, including raw outcome bitstrings, compilation artifacts, QPU timing, and calibration snapshots, are uploaded to the metriq-data repository.
- The frontend ingests new results, updates overall and per-device leaderboards, and enables detailed drill-down.
All tools and data releases follow FAIR principles (Findable, Accessible, Interoperable, Reusable), using Apache-2.0 for software and CC-BY-4.0 licenses for data (Cosentino et al., 9 Mar 2026).
2. Benchmark Suite Structure
Metriq’s suite spans fundamental characterization and algorithmic workloads, with emphasis on benchmarks that scale with processor width and are “practically” executable under real resource and cost constraints (Cosentino et al., 9 Mar 2026, Barbaresco et al., 2024).
System-level Metrics:
- Bell State Effective Qubits (BSEQ): Measures the ability to generate and maintain CHSH-violating Bell states across topology; aggregates to a score via largest entangled cluster size (LCCS) and normalized connectivity fraction.
- Error Per Layered Gate (EPLG): Quantifies two-qubit gate fidelity in multi-qubit chains using layered simultaneous randomized benchmarking; normalized to a base device.
- Mirror Circuits: Implements scalable randomized benchmarking using Clifford “echo” circuits; polarization is computed and aggregated using width-dependent weighting, supporting scalability up to hundreds of qubits (Proctor et al., 2021).
- Circuit Layer Operations Per Second (CLOPS): Assesses raw QPU throughput in quantum circuit execution by dividing total layered-gate count by wall-clock execution time, modulo queueing delays.
Application-inspired Workloads:
- Quantum Machine Learning Kernel (QMLK): Inner-product kernel estimation via parameterized feature map circuits, assessing performance at varying widths.
- Wormhole-Inspired Teleportation (WIT): Implementation of toy “wormhole” circuits, with expectation values as logical-fidelity proxies.
- Linear-Ramp QAOA (LR-QAOA): QAOA for weighted MaxCut on chain graphs, assessing empirical approximation ratios against random baselines.
- Quantum Fourier Transform (QFT): Fidelity of QFT implementations using QED-C protocols, aggregated using width-weighted normalization.
Each benchmark is defined with explicit schemas for parameters, configuration tables (shot counts, widths, depths), and aggregation rules. Effective “scale” parameters (average qubit width per subscore) are used in subsequent performance aggregation.
3. Metrics Reporting and Data Model
Each benchmark run produces schema-validated JSON files incorporating:
- Provenance metadata (provider, device, datetime, benchmark/config, transpiler version)
- Compiled circuits, random seeds/code commit, calibration snapshot
- Measured values (e.g., polarization, CHSH , layer fidelity, kernel accuracy)
- Execution metadata (QPU execution time, compiler settings, connectivity at run)
- Aggregated sub-scores and final composite indices
Metriq maintains fully public version-controlled datasets, permitting rollbacks, diffs, and meta-analyses. Reports are CC-BY-4.0 licensed and indexed on the Metriq web documentation portal (Cosentino et al., 9 Mar 2026).
The data structure is directly modeled on standard collection/aggregation templates (Lall et al., 10 Feb 2025). Each metric specifies: formal definition, measurement protocol, uncertainty, hardware/software context, and relevant circuit/configuration details.
4. Performance Aggregation and the Metriq Score
To summarize device or provider performance, Metriq defines the composite Metriq Score () by:
- Within-benchmark aggregation: For each benchmark with circuit widths , take a width-weighted sum of measured values,
- Baseline normalization: For each benchmark, normalize to a fixed baseline device ,
with reversal for “lower-is-better” metrics. Missing data assigns .
- Cross-benchmark aggregation: Assign each benchmark a weight proportional to its effective scale (mean qubit width),
yielding a normalized scalar summary for leaderboard comparison. This formulation is fully transparent, baseline- and weight-adjustable, and directly traceable back to raw result bundles (Cosentino et al., 9 Mar 2026).
5. Inter-Platform Comparisons and Analysis
The shared dataset enables robust cross-platform analyses. In deployments spanning leading superconducting (IBM Eagle/Heron), ion-trapping (Quantinuum), and neutral-atom (IQM, Rigetti) devices, practical trends and correlations are observed:
| Device | BSEQ Score | EPLG Score | Mirror Circuits | QMLK Score | QFT Score |
|---|---|---|---|---|---|
| IBM Eagle/Heron | 125–135 | 37–338 | ~0.26 (max) | 0.84–0.03 | 0.17–0.59 |
| Quantinuum H2-2 | ≈58 | lowest | lower | lower | >0.99@4q |
| IQM/Rigetti | lower | higher | near-zero | ~0.03 | <0.16@20q |
- High BSEQ and low EPLG scores indicate robust multi-qubit entanglement and gate quality on best-in-class superconducting and shuttling-ion devices.
- Polarization in mirror circuits and kernel accuracies collapse for widths exceeding ~50 qubits, indicating scalability bottlenecks due to two-qubit gate infidelity and decoherence.
- Correlation analyses (e.g., Spearman ρ = 0.991 between mirror circuits and QMLK, ρ = 0.936 for BSEQ and LR-QAOA) demonstrate shared sensitivities to multi-qubit coherence and entangling-gate performance.
- Two-qubit gate error remains the dominant factor; anti-correlation ρ > 0.97 between vendor-reported gate errors and performance indices (Cosentino et al., 9 Mar 2026).
6. Integration with Standards, Methodology, and Extensions
Metriq builds on established benchmarking paradigms (e.g., IBM’s Quantum Volume/CLOPS, SupermarQ, QLINPACK, QPack, CUCO) (Barbaresco et al., 2024), unifying and extending them with explicit, explainable multi-criteria aggregation, open reference implementations, and full-stack provenance. The roadmap is aligned with efforts identified in contemporary reviews (Lall et al., 10 Feb 2025):
- Agreement on “core” metric categories (architecture, gate/circuit/application-level, speed, stability)
- Hardware-specific metrics for annealers, neutral atoms, and photonic processors
- Inter-laboratory ring tests, SOP curation, and automated submission pipelines (Nextflow/Snakemake integration)
- Mandated reporting of experiment/circuit lists, seeds, environment, and code
- Community-led iterative refinement and versioning of benchmarks
Planned extensions include:
- Integrated quantum error mitigation subscores (e.g., via Mitiq)
- Logical-level and fault-tolerance benchmarks for early FTQC hardware
- Expansion to additional modalities and deeper compiler/hardware–software decoupling
- API-driven contribution and fast-track review for provider onboarding
7. Impact, Community Feedback Loop, and Future Directions
Metriq functions as an open, transparent, and evolving benchmarking nexus. Continuous aggregation of cross-platform results not only uncovers device-specific capabilities and bottlenecks but drives the adaption of the suite to emerging hardware and research priorities. Analyses of dataset-wide trends inform benchmark weights, highlight discriminative protocols, and feed directly into standardization bodies (e.g., ISO/IEC, IEEE P7131/P3329, AFNOR/CN QT, CEN/CLC JTC 22 WG3) (Barbaresco et al., 2024, Cosentino et al., 9 Mar 2026). Example findings include:
- Identification of entanglement and two-qubit gate quality as current principal bottlenecks for algorithmic-scale circuits
- Quantification of platform-specific cost structures for practical deployment
- Visible coupling and decoupling between speed, quality, and application-tuned metrics
Ongoing development focuses on broadening benchmark coverage, integrating error mitigation, logical benchmarks, and mid-circuit measurement capabilities, and fostering international participation and reproducibility across the quantum computing ecosystem (Cosentino et al., 9 Mar 2026, Lall et al., 10 Feb 2025).