Qiskit-HumanEval-hard Benchmark

Updated 4 September 2025

Qiskit-HumanEval-hard Benchmark is a rigorously designed evaluation suite that measures LLMs’ ability to synthesize and validate complex quantum programs using Qiskit.
It features diverse tasks, including quantum circuit synthesis, simulation, advanced circuit manipulation, and hybrid workflows, each with detailed unit tests.
The benchmark drives advances in quantum software and LLM training by integrating quantum-verifiable rewards like DPO and GRPO, improving pass@1 scores.

The Qiskit-HumanEval-hard Benchmark is a rigorously constructed evaluation suite designed to measure the capacity of LLMs and quantum software systems to synthesize, execute, and validate complex quantum programs using Qiskit, IBM’s open-source quantum computing SDK. Unlike prior benchmarks that focus primarily on classical code generation, Qiskit-HumanEval-hard presents domain-specific challenges that reflect both the syntactic and semantic complexity of quantum programming, with a particular emphasis on tasks that require correct modeling, simulation, and deployment of circuits on contemporary quantum hardware and simulators. The benchmark’s construction, task diversity, and evaluation protocols establish it as a key reference point for measuring progress in both quantum code synthesis and the development of robust, hardware-compatible quantum software agents.

1. Benchmark Structure and Motivation

Qiskit-HumanEval-hard is motivated by the need for a quantum-native equivalent of the classical HumanEval benchmark, capable of capturing the intricacies of quantum program generation and execution (Dupuis et al., 29 May 2024, Vishwakarma et al., 20 Jun 2024, Kheiri et al., 16 Jul 2025, Dupuis et al., 28 Aug 2025). Its tasks are hand-curated, each including:

A natural-language prompt describing a quantum computing task
A canonical Qiskit solution script
One or more rigorous unit tests assessing correctness at both the code and execution level
An entry point and typically a graded difficulty annotation (basic, intermediate, advanced)

This benchmark explicitly omits import statements in prompts, requiring models to autonomously select appropriate Qiskit modules and APIs—a design choice that tests not only completion accuracy but also deeper library understanding and self-sufficiency in the quantum code context (Dupuis et al., 28 Aug 2025).

2. Task Families and Quantum-Specific Demands

Over 100 tasks spanning major subdomains of quantum software engineering are included (Vishwakarma et al., 20 Jun 2024). Categories are:

Quantum Circuit Synthesis: Construction of entangled states, state preparation, and algorithmic primitives (e.g., GHZ, Bell, teleportation, QAOA, quantum Fourier transform).
Simulation and Execution: Running circuits via Aer, invoking Estimator and Sampler primitives, managing backend selection and configuration.
Advanced Circuit Manipulation: Transpilation, hardware-aware routing, custom basis decompositions, integration with IBM Quantum Runtime.
Hybrid Classical-Quantum Workflows: Implementing quantum-classical loops, parameter sweeps, and result postprocessing.
Visualization and Reporting: Circuit drawing, measurement result plotting, and serialization tasks.
Error Handling and Resource Management: Correct handling of API evolution (e.g., deprecated modules), efficient use of quantum resources (measurement, reset, ancilla management).

Each task is accompanied by human-authored unit tests that frequently exercise the code under simulated noise or on actual IBM Quantum devices, ensuring both functional and physical executability (Vishwakarma et al., 20 Jun 2024, Dupuis et al., 28 Aug 2025).

3. Evaluation Protocol: Metrics and Execution

Automated, execution-based evaluation forms the core of the benchmark protocol (Dupuis et al., 29 May 2024, Vishwakarma et al., 20 Jun 2024, Kheiri et al., 16 Jul 2025, Dupuis et al., 28 Aug 2025). Solutions are sandboxed and run under strict environment controls using the latest Qiskit SDK. The primary metric is pass@1:

$\text{Pass@1} = \frac{\text{Number of code generations passing all unit tests}}{\text{Total number of tasks}} \times 100\%$

Tests verify not just type constraints and output values, but also correct circuit compilation, hardware interaction, and numerics in measurement postprocessing. For challenging tasks involving hardware integration, the benchmark requires models to infer and invoke the correct set of Qiskit primitives, handle missing imports, and manage configuration automatically—a significant increase in required context understanding and operational fidelity (Dupuis et al., 28 Aug 2025, Dupuis et al., 29 May 2024).

4. Relevance to Model Training and Reward Design

Recent research demonstrates that LLMs adapted to this benchmark exhibit marked improvements when trained via preference-based and quantum-verifiable reinforcement learning strategies (Kheiri et al., 16 Jul 2025, Dupuis et al., 28 Aug 2025). Notably:

Direct Preference Optimization (DPO): Models are tuned on synthetic datasets of accepted (unit-test passing) versus rejected (failing) code generations, aligning them closely with functional quantum program patterns.
Group Relative Policy Optimization (GRPO): Multiple candidate completions are generated and scored by quantum-verifiable rewards (fraction of passing tests upon execution), with score normalization (advantage) driving stable and effective weight updates.
Hybrid Approaches: Sequential application of DPO and GRPO yields models that can both produce high-level, maintainable code (as measured by DPO) and robust, executable code (as measured by GRPO and quantum-verifiable reward) (Dupuis et al., 28 Aug 2025).

Performance gains are observed in these settings, with pass@1 scores on Qiskit-HumanEval-hard surpassing previous open-source baselines by significant margins; for example, ORPO-tuned models reached 56.29% pass@1, GRPO 49%, and hybrid DPO+GRPO models outperform models orders of magnitude larger (Kheiri et al., 16 Jul 2025, Dupuis et al., 28 Aug 2025).

5. Comparative Insights and Limitations

Benchmark analysis reveals distinct strengths and weaknesses across model variants and training regimens (Kheiri et al., 16 Jul 2025):

Task Type	GRPO Best	ORPO Best	Advanced Tasks
Basic	Yes	No	None solved
Intermediate	No	Yes	None solved
Advanced	No	No	None solved

GRPO models excel at basic structural circuit construction and correctness.
ORPO models are stronger on intermediate, more abstract tasks involving nuanced API usage.
All current approaches fail to solve the advanced subset, indicating persistent challenges in multi-stage, highly integrated quantum code requirements.

This suggests a research opportunity in combining task-specific contextualization with robust reward signals, as well as the need for further algorithmic innovation in quantum code understanding.

6. Broader Implications for Quantum Software and AI

Qiskit-HumanEval-hard serves as a testbed not just for LLMs but for quantum SDK evolution itself, driving improvements in import resolution, API stability, and error mitigation strategies (Javadi-Abhari et al., 14 May 2024, Pathak et al., 17 Aug 2025). It also incentivizes the design of quantum programming frameworks and transpilers optimized for rigorous, end-to-end verification. The integration of quantum-verifiable learning signals—quantitative evaluation based on execution in high-fidelity simulators or real quantum hardware—is a defining methodological advance, ensuring that code recommended by AI systems is robust to both software and hardware idiosyncrasies (Dupuis et al., 28 Aug 2025).

7. Future Directions and Extensions

Emerging research directions, as motivated by the benchmark’s challenges, include:

Expansion to cover more advanced topics, such as error-corrected quantum codes, dynamic circuits, and quantum networking protocols.
Deeper integration of reward structures that combine syntactic, semantic, and execution-level validation.
Longer-context and self-healing models capable of adapting to rapid SDK evolution.
Benchmark-led improvements in SDK modularity and hardware interface abstraction to facilitate truly automated quantum-classical workflows (Javadi-Abhari et al., 14 May 2024, Pathak et al., 17 Aug 2025).

A plausible implication is that adoption of benchmarks of this rigor and scope will accelerate both AI-assistance in quantum research and the pace of SDK/hardware co-design, making execution-based, end-to-end evaluation a normative best practice in the quantum software community.

In summary, Qiskit-HumanEval-hard is a high-fidelity, execution-grounded benchmark that reflects contemporary demands in quantum programming, LLM alignment, and quantum software development. Its tasks, methodology, and evaluation protocol have already shaped research in model training, SDK development, and robust quantum code synthesis, offering a blueprint for future standards in AI-assisted quantum programming (Dupuis et al., 29 May 2024, Vishwakarma et al., 20 Jun 2024, Kheiri et al., 16 Jul 2025, Dupuis et al., 28 Aug 2025).