QuanBench: Quantum Code Generation Benchmark
- QuanBench is a benchmark suite for LLM-generated quantum programs that evaluates both functional correctness and quantum-semantic fidelity across diverse quantum tasks.
- It employs natural language prompts, canonical solutions, and unit tests for 44 tasks, including algorithm implementation, state preparation, gate decomposition, and quantum machine learning.
- Performance metrics such as Pass@K and process fidelity reveal that even top models struggle with accurate quantum code generation, highlighting areas for targeted improvement.
QuanBench is a benchmark suite for assessing the quantum code generation capabilities of LLMs. Designed to rigorously evaluate both the functional correctness and quantum-semantic fidelity of LLM-generated quantum programs, QuanBench covers a broad set of quantum programming tasks representative of contemporary scientific and engineering challenges. The benchmark emphasizes executable solutions, adherence to canonical quantum semantics, and analysis of LLM performance failure cases, offering insights that inform both the state of current models and future directions for improvement (Guo et al., 19 Oct 2025).
1. Purpose and Task Scope
QuanBench is constructed to measure an LLM’s ability to generate correct, executable, and quantum-semantically faithful code for quantum programming tasks defined in the Qiskit framework. Each benchmark task is specified by a natural language prompt and is accompanied by a canonical solution, unit tests, and a reference implementation.
Task categories included in QuanBench are:
- Quantum algorithm implementation: Tasks include canonical problems such as Grover’s search, Shor’s algorithm, Quantum Fourier Transform (QFT), Deutsch–Jozsa, and Bernstein–Vazirani algorithms.
- Quantum state preparation: Tasks focus on the preparation of particular entangled or computational basis states, such as Bell states and GHZ states.
- Gate decomposition: Tasks involve expressing high-level operations by decomposing them into elementary gates compatible with standard hardware (e.g., decomposition into single-qubit and CNOT gates).
- Quantum machine learning: Tasks require the implementation of parameterized quantum circuits, often used in variational quantum algorithms and hybrid models.
The entire benchmark consists of 44 carefully curated tasks, each representing a specific facet of practical quantum software development.
2. Evaluation Metrics
Two principal metrics are used to objectively assess LLM performance on QuanBench:
- Functional Correctness (Pass@K):
- Pass@K quantifies the probability that at least one of K generated code samples for a task yields an executable, correct solution (i.e., passes all reference and simulation-based unit tests).
- A statistically unbiased estimation approach—adapted from the HumanEval benchmark—is used for Pass@K calculation.
- Empirically, current leading LLMs achieve Pass@1 rates below 40%, and Pass@5 rates approach but rarely surpass 50%.
- Quantum Semantic Equivalence (Process Fidelity):
- Process Fidelity measures the unitary operation similarity between the LLM-generated circuit and the canonical circuit, independent of possible superficial syntactic divergences (such as gate ordering or auxiliary gate insertions that do not alter quantum state evolution).
- It is formally defined as:
where and are the canonical and generated unitaries for an -qubit task and . - A score of 1 indicates exact equivalence up to a global phase. - This metric reveals discrepancies not apparent from functional tests alone, such as gate sequence permutations or phase mishandling.
3. LLM Performance and Observed Limitations
Comprehensive benchmarking of recent LLMs—including general-purpose (GPT-4.1, Claude 3.7, Gemini 2.5) and code-specialized (CodeLlama, DeepSeek) models—demonstrates the following:
- Even top models have difficulty achieving high accuracy: the best models report sub-40% Pass@1 and around 50% Pass@5 rates.
- Performance varies by task type: models such as DeepSeek R1 perform best on state preparation tasks, while others show moderate success in different algorithmic categories.
- Process Fidelity is often lower than expected even for functionally correct programs, highlighting subtle but important semantic deviations.
This suggests that, despite advances in code generation, current LLMs lack robust generalization for non-trivial quantum programming domains.
4. Analysis of Common Failures
QuanBench systematically categorizes frequent issues in LLM-generated quantum programs:
| Failure Type | Description | Example Consequence |
|---|---|---|
| Outdated API Usage | Use of deprecated functions (e.g., cu1 in Qiskit) |
Compilation error or unexpected run |
| Circuit Construction | Incorrect qubit assignment or gate sequence | Logical errors, failed assertions |
| Semantic/Algorithmic | Incomplete or misimplemented algorithmic logic | Low Process Fidelity, incorrect results |
- Outdated API Usage: Many models invoke Qiskit methods that have been deprecated (e.g.,
cu1instead ofcpfor controlled-phase gates), leading to non-executable code. - Circuit Construction Errors: Errors include inconsistent qubit indices, use of the same qubit as both control and target, and omission or repetition of crucial gates.
- Incorrect Algorithm Logic: In tasks such as Grover’s search, failures to construct correct oracle or diffusion operators are common. In state preparation, missing or misplaced phase gates are observed.
These categories highlight the dual need for temporal currency in training materials and deep cognitive modeling of quantum programming semantics.
5. Quantitative and Qualitative Findings
The benchmark results show a clear pattern: LLMs are prone to both syntactic and semantic failure modes, with semantic equivalence (Process Fidelity) being the more demanding criterion. Even for tasks where generated code passes all functional (unit test) checks, the underlying transformation may diverge significantly from the canonical implementation, as revealed by low fidelity.
Performance breakdown reinforces that reliable quantum code synthesis with LLMs currently requires substantial post-processing, code review, or domain-specific verification.
6. Future Directions and Implications
QuanBench establishes a systematic baseline for quantum code generation assessment and reveals substantial gaps between the capabilities of current LLMs and the requirements of production-quality quantum software:
- The results indicate an urgent need for targeted fine-tuning of LLMs on recent quantum programming datasets, automated management of evolving APIs, and perhaps integration of type- or circuit-theoretic constraints into LLM-generated code.
- The benchmark is positioned for extension—future releases may encompass tasks requiring more qubits, additional quantum programming frameworks (e.g., Cirq, PennyLane), and new evaluation criteria reflective of emerging quantum development paradigms.
- The framework motivates research on advanced semantic equivalence checking and on reinforcement learning approaches where LLMs are guided by process fidelity or resource-aware execution metrics.
A plausible implication is that quantum code generation represents a domain where purely statistical natural language modeling is insufficient; robust solutions will likely require domain-adaptive representation learning, symbolic verification, and strong integration with evolving quantum toolchains.
7. Significance and Outlook
QuanBench provides a unified, executable, and semantically-aware testing harness for quantum code synthesis tools built on LLM backends. Its multidimensional performance measures, rigorous evaluation, and failure case diagnostics set a precedent for future work in quantum AI programming environments. It provides a foundation for new research in LLM fine-tuning strategies, evaluation of quantum-aware LLMs, and development of hybrid verification pipelines at the intersection of AI and quantum programming (Guo et al., 19 Oct 2025).