Empirical Studies in QST

Updated 20 January 2026

Empirical studies in QST systematically evaluate quantum software testing by executing programs with designed inputs to assess fault detection rates and test coverage.
The research examines diverse methodological dimensions, including object under test selection, fault injection strategies, and statistical measurement techniques to address quantum hardware constraints.
These studies inform scalability and reproducibility practices by emphasizing rigorous experimental setups, baseline comparisons, and detailed reporting of test configurations.

Empirical studies in quantum software testing (QST) are designed to systematically evaluate the effectiveness, robustness, and cost-efficiency of testing approaches for quantum programs by executing these programs (either on simulators or hardware) with designed test inputs and quantifying various performance metrics such as fault detection and test coverage. These studies address challenges posed by probabilistic program semantics, non-deterministic quantum measurement, noise, and hardware constraints, and they provide the evidentiary basis for advancing testing research and engineering in quantum software domains (Li et al., 13 Jan 2026).

1. Methodological Dimensions of Empirical Studies in QST

Empirical QST studies exhibit extensive methodological heterogeneity across ten key dimensions:

Objects Under Test (PUTs): Studies typically employ quantum algorithms and subroutines (e.g., Quantum Fourier Transform, Grover Search, Phase Estimation), artificial programs, and, less frequently, real-world code (e.g., Bugs4Q, Qbugs). The prevalence of narrow and small program pools creates significant external validity concerns, while reporting granularity for program types and counts is inconsistent (Li et al., 13 Jan 2026).
Fault Injection and Buggy Variants: QST studies use mutant-level (gate or subroutine mutation), version-level (codebase variants), or model-level (learning model corruption) strategies to create faults for detection benchmarking.
Test Scalability and Circuit Complexity: Studies variably report the width (number of qubits), size (gate count), and depth (circuit layers) of tested circuits; high artificial program counts (median ~550 per study) are sometimes used to ensure statistical power.
Test Inputs: Test case design encompasses initial quantum states ( $\rho$ ), classical input arguments, and measurement operators, structured as $t_{\mathrm{in}} = (\rho, c, \{E_m\}_{m \in \Lambda(O)})$ . The size and diversity of test suites are rarely justified rigorously relative to the input space (Li et al., 13 Jan 2026).
Test Oracles: Oracle functions $\mathcal{D}: T \to \{0,1\}$ label cases as pass or fail; approaches span Wrong Output (WOO), Output Probability (OPO), Property-Based (PBO), Dominant Output (DOO), Quantum State (QSO), and Quantum Operation (QOO) types. Most studies default to probability-based oracles, with limited formal requirement specification (Li et al., 13 Jan 2026).

2. Experimental Setups and Statistical Practices

An essential property of empirical QST is the explicit configuration of experiment execution:

Execution Backends: Studies overwhelmingly run test protocols on ideal simulators (approximately 81%), with noisy simulators and physical quantum hardware each constituting around 10%. Despite access to NISQ hardware, simulators remain dominant due to cost, stability, and repeatability. Few studies provide justification for backend choice.
Shots and Experimental Repetitions: The number of measurement shots per circuit is highly variable, ranging from 50 to $10^7$ , with adaptive or varied schemes employed in less than half the studies. Experimental repetitions, critical for statistical robustness, are often underreported (median < 50; only two studies > 1000), in contrast to standards in software engineering. Reproducibility is further undermined by inconsistent reporting of these configurations (Li et al., 13 Jan 2026).
Data and Tool Support: Artifact release is inconsistent (56% public), with common frameworks including Qiskit, QMutPy, and Muskit. Test harnesses, scripts, and benchmark metadata are not uniformly available or standardized, impeding reproduction and comparison (Li et al., 13 Jan 2026).

3. Evaluation Metrics, Coverage, and Statistical Comparison

Key empirical metrics and statistical approaches include:

Metric/Concept	Description/Definition	Reference
Gate-branch coverage	$C_{\mathrm{gb}} = \frac{\|\{\text{exercised gate–branch pairs}\}\|}{\|\{\text{total gate–branch pairs}\}\|}$	(Li et al., 13 Jan 2026)
Test input	$t_{\mathrm{in}} = (\rho, c, \{E_m\}_{m \in \Lambda(O)})$	(Li et al., 13 Jan 2026)
Test oracle	$\mathcal{D}:T\to\{0,1\}$	(Li et al., 13 Jan 2026)
Statistical tests	Pearson $\chi^2$ for OPO; Mann–Whitney U with Vargha–Delaney $\hat{A}_{12}$	(Li et al., 13 Jan 2026)

Studies measure effectiveness via fault-detection rate, coverage, and cost (shots, time). Comparative analyses employ naive, state-of-the-art (SOTA), adapted, ablation, and composite baselines. However, limitations include shallow or inconsistent baseline selection, limited ablation studies, and under-application of nonparametric tests. Metrics such as effect size (Vargha–Delaney $\hat{A}_{12}$ , Cliff’s $\delta$ ) and shot/repetition-dependent error bounds are recommended but rarely standardized (Li et al., 13 Jan 2026).

4. Major Limitations and Open Methodological Challenges

Cross-study analysis identifies several prominent challenges:

Requirement Specification Gap: Formal, executable specifications for quantum program behavior are seldom articulated, resulting in ad hoc oracles that may fail to capture the true semantics of faults.
Oracle Soundness and Adequacy: Many studies rely on oracles that perform simple distributional checks, risking high false-negative rates or missing phase-related errors; development and reporting of property-based or phase-sensitive oracles is weak (Li et al., 13 Jan 2026).
Test Input and Suite Design: Input selection and suite sizing rarely invoke formal adequacy criteria or coverage metrics aligned with quantum program state space.
Experimental Scalability: Many studies employ small qubit counts (<10), unjustified or ad hoc shot/repetition settings, and lack systematic trade-off analysis between test cost and effectiveness.
Benchmarking and Artifact Reuse: Availability of standardized, QST-specific benchmarks is poor, and artifact sharing is not uniform, limiting reproducibility and cumulative progress (Li et al., 13 Jan 2026).

5. Insights and Methodological Recommendations

Based on systematic review, several recommendations address the above limitations (Li et al., 13 Jan 2026):

Requirement Alignment: Explicitly specify the functional requirements and semantics of each object under test, and select oracles (e.g., QSO/QOO) that directly match these requirements.
Test Input Design: Document initial state, classical input parameters, and measurement operators for all test cases. Test suite size should be justified in relation to the circuit’s input and state complexity.
Oracle Typology and Soundness: Employ a taxonomy of oracles and assess the trade-off between oracle sophistication, detection power, and cost, including analysis of false-positive/negative rates.
Scalability Practices: Report and justify the choice of circuit width, depth, and shot counts. Circuits spanning at least 10 qubits are recommended for scalable evaluation; experimental repetitions should be ≥30, ideally ≥100, adjustable for circuit complexity and target statistical reliability.
Baseline and Statistical Rigor: Use multiple baselines covering at least naive and SOTA approaches. Apply nonparametric statistical tests and report effect sizes to ensure rigorous comparative analysis.
Open Science: Release all code, benchmarks, harnesses, and datasets, archiving with digital object identifiers (DOIs) to ensure traceability.

6. Empirical Extensions: QST in Tactile Psychophysics and Cognitive Science

The principles and protocols of empirical QST extend into human sensory and cognitive domains under “quantum-like” statistical testing paradigms. For example, Hua et al. investigated modulation of two-point discrimination thresholds (2PDTs) via chemically induced global stimulation, demonstrating additive effects in tactile acuity and interpreting outcomes with reference to wide dynamic range (WDR) neural mechanisms (Hua et al., 2024). In cognitive decision-making, protocols inspired by the Stern–Gerlach experiment enable the construction and statistical validation of quantum decision models; negativity in discrete Wigner functions derived from empirical question-response sequences certifies genuine quantum-like interference effects inaccessible to classical probabilistic reasoning (Fell et al., 2019). These studies employ rigorous experimental designs, forced-choice protocols, statistical model fitting, and explicit significance testing (e.g., bootstrap-resampled confidence intervals), reflecting high standards for empirical QST across diverse domains.

7. Future Directions

The empirical QST landscape is evolving toward greater methodological standardization, larger-scale and more diverse program objects, improved oracle development, and integration with quantum hardware advances. A plausible implication is that artifacts, protocols, and benchmarks tailored to QST-specific objectives will become increasingly central. Expanding reproducible research infrastructures, aligning test metrics with formal requirements, and addressing the scalability-cost paradigm are likely to remain central challenges and focus areas for the discipline (Li et al., 13 Jan 2026).