Efficient Parallel Compilation and Profiling of Quantum Circuits at Large Scales

Published 31 Mar 2026 in cs.DC | (2603.29598v1)

Abstract: Compiling quantum circuits is a major bottleneck in quantum computing, and given the scale required in a few years, is likely to become infeasibly long. Techniques to reduce compilation time for quantum circuits are sorely needed. Furthermore, resources to test acceleration techniques are similarly lacking due to the limited scale of circuits in benchmark suites and mismatches in characteristics of these circuits and those produced by random circuit generators. This paper resolves the latter of these problems by describing a random circuit generator which allows control of circuit density, width and depth parameters. This is used to derive 8000 experimental large-scale circuits and test a novel approach to compiler parallelisation. This separates a circuit into sub-circuits which are compiled in parallel and recombined to produce a compiled circuit. When the parallel approach was tested using Qiskit, a peak speedup of 15.56 was achieved with corresponding overheads of less than 1%.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper presents a novel random circuit generator that controls width, depth, and density to mimic real benchmark distributions.
It introduces a parallel compilation approach by decomposing circuits into temporally ordered sub-circuits with minimal SWAP overhead.
Experimental results demonstrate up to 19.8× speedup and negligible overhead, confirming the method's scalability across different compilers.

Efficient Parallel Compilation and Profiling of Quantum Circuits at Large Scales

Motivation and Problem Statement

Quantum circuit compilation has become an acute bottleneck for practical quantum computation. As circuit sizes grow—both in qubit width and gate depth—the time required for compilation often exceeds simulation and even hardware execution, especially for circuits containing >100 qubits or >100,000 gates. Scaling to millions of qubits and gates, as projected for future quantum hardware, is infeasible under current compiler designs. The lack of sufficiently large circuit benchmarks and random circuit generators that can model realistic density characteristics further impedes empirical studies of compilation performance.

Existing benchmark suites (e.g., QASMBench, MQTBench, Red Queen) offer limited variety and scale; only a small subset feature deep and wide circuits simultaneously. Moreover, density, a key determinant of compilation cost, is rarely matched by random generators which typically saturate circuits to 100% density—a scenario unrepresentative of hand-crafted algorithmic circuits. There is thus a critical need for scalable circuit generators and robust parallelisation strategies applicable across compilers and routing algorithms.

Figure 1: a) Scatter chart illustrating the depth and width statistics for 5 benchmark suites with gate density. b) Box plot illustrating the density distribution of the 5 benchmark suites.

Random Circuit Generation with Controlled Density

The paper introduces a random circuit generator capable of controlling width, depth, and density parameters. Density ( $d$ ), defined as $d = \frac{n_{q1} + 2n_{q2}}{\text{depth} \times \text{width}}$ , is regulated post-generation via probabilistic gate removal, ensuring generated circuits mimic the density distributions observed in benchmark libraries (typically < $60\%$ ). The generator thus enables systematic production of circuits up to 200 qubits and with $>16$ million gates—substantially exceeding prior empirical limits and facilitating comprehensive compiler profiling.

Figure 2: Compilation times for quantum circuits of varying depth, width, and gate density.

Parallelisation Approach and Algorithmic Design

Compilation is dominated by qubit routing, an NP-Complete problem exacerbated by nearest-neighbor constraints in realistic hardware layouts (e.g., IBM Melbourne, linear or grid topologies). The proposed parallelisation methodology operates by decomposing a circuit into temporally ordered sub-circuits, each compiled independently in parallel, followed by the insertion of permutation circuits to realign qubit mappings before re-concatenation.

Key attributes:

Compiler and routing algorithm agnostic: Applicable to Qiskit (SabreSwap, BasicSwap), PyTKET, and others.
Load balancing: Sub-circuit division proportional to gate count, adjustable to optimize routing workload (potential future consideration: balance via multi-qubit gate distribution).
Permutation circuit synthesis: Minimally invasive SWAP insertion (A* search) to restore logical qubit orderings across sub-circuit boundaries.
Workflow efficiency: Direct manipulation of gate instruction lists avoids unnecessary overheads of circuit objects.
Figure 3: Flow charts illustrating the workflows for a) Qiskit and b) the parallel compilation approach using three processors.

Figure 4: Example six-qubit circuit showing 2-qubit gates incompatible with processor topology (orange outline).

Figure 5: Sub-circuit decomposition of the example circuit, each with trivial initial qubit orderings.

Figure 6: Compiled sub-circuit with SWAP insertion to enable NNA compliance; CNOT gates highlighted.

Figure 7: Compiled sub-circuits with permutation circuits appended (SWAP gates). Barriers mark permutation circuit boundaries.

Figure 8: Final concatenated compiled circuit, including SWAP gates and permutation segments.

Experimental Evaluation

Qiskit (SabreSwap, BasicSwap) Results

Compiled 400 circuits spanning 20–200 qubits, 10k–100k gates, densities of 20–100%.
Peak speedup for SabreSwap reached 12.95 on 16 cores for high-density, deep circuits; BasicSwap peaked at 15.56.
Gate, SWAP, and depth overheads remain minimal: typically <1% (e.g., $0.2\%$ gate overhead, $0.25\%$ SWAP, $0.85\%$ depth for SabreSwap).
For low-density circuits and benchmarks, peak speedup is reduced, reflecting limited parallelism scope.
Figure 9: Speedup and overhead variation with number of processors for random circuits using SabreSwap.

Figure 10: Speedup heat maps for 100% density random circuits using SabreSwap.

Figure 11: SWAP overhead heat maps for 100% density random circuits using SabreSwap.

Compiler and Routing Algorithm Effects

PyTKET routing (RoutingPass) achieves peak speedup of 19.8 for dense, deep circuits; overheads remain negligible (<0.25%).
Speedup is strongly correlated with circuit depth and density; width correlation is weaker due to depth reduction post-decomposition.
Cross-compiler results are consistent; parallelisation is robust to compiler improvements and topology variations (grid and linear).

Figure 12: Speedup and overhead cost variation with processors for random circuits compiled using PyTKET.

Figure 13: Speedup and overhead cost variation with processors for benchmark circuits compiled using PyTKET.

Additional Analysis: Memory, Fidelity, and Practical Constraints

Memory usage: Parallel implementation often reduces peak per-process memory footprint. Aggregate memory scales with number of cores but can exceed monolithic compilation for extremely wide circuits, necessitating careful selection of parallelism degree.
Circuit fidelity: Simulated outputs of parallel-compiled circuits show fidelity indistinguishable from monolithic compilation, confirming correctness preservation.
Permutation circuit impact: Overhead induced by permutation circuits is minimal for large, high-density circuits but non-negligible for short or low-density circuits.
Figure 14: Memory usage analysis with Qiskit's SabreSwap algorithm.

Figure 15: Fidelity analysis of parallel vs monolithic compilation using Qiskit's SabreSwap.

Implications and Future Directions

The presented methodology addresses two practical limitations in quantum compilation at scale: empirical benchmarking (via controlled random circuit generation) and accelerated compilation (via effective parallelisation). The demonstrated speedups, negligible overheads, and compiler-agnostic applicability have direct implications for hybrid quantum-classical workflows, near-term NISQ execution pipelines, and large-scale circuit simulation.

Theoretical implications span scalable quantum software engineering, distributed circuit design, and optimal resource allocation models. Immediate future directions include:

Auto-calibration of parallel sub-circuit quantity based on circuit physical attributes.
Load balancing refinement using multi-qubit gate statistics.
Integration with distributed quantum computing frameworks and topology-aware optimisers.

Emerging quantum hardware with multi-core architectures and large memory pools will benefit substantially from these techniques. The generator's extensibility to mimic algorithmic circuit structure (e.g., Shor, Grover) is a promising avenue for more realistic profiling.

Conclusion

This work systematically resolves both the benchmarking and compilation bottleneck for large-scale quantum circuits. The generator enables unprecedented empirical circuit profiling. The parallel compilation approach achieves significant acceleration across compilers and routing algorithms with minimal overhead, up to 19.8× speedup for real-world circuit scales. Memory and fidelity analyses indicate robust practical applicability. Future research should focus on optimal load balancing, deeper integration with quantum distributed computing, and enhancing random circuit generators to more precisely emulate structured algorithm outputs.