Hybrid QPU-GPU Architectures
- Hybrid QPU-GPU architectures are heterogeneous computing systems that couple quantum processing for kernel execution with GPU-driven classical numerics.
- They enable varied integration regimes—from loose, cloud-native workflows to tightly coupled designs using PCIe/CXL—that optimize resource management and minimize latency.
- Specialized software stacks and scheduling mechanisms orchestrate the diverse roles of CPUs, GPUs, and QPUs, ensuring robust error correction, calibration, and scalable performance.
Hybrid QPU-GPU architectures are heterogeneous computing systems in which quantum processing units (QPUs) and graphics processing units (GPUs) participate in a single computational stack, typically alongside CPUs, FPGAs, storage, and networked control electronics. In the literature, this topic spans several distinct but converging design lines: client–server hybrid programming models derived from accelerator offload frameworks, HPC deployments in which QPUs are scheduled as special resources next to GPU nodes, cloud-native workflow systems that orchestrate CPUs, GPUs, and QPUs through containerized DAGs, and tightly coupled system-level designs in which the QPU is treated as a peripheral device managed by the operating system or by a low-latency real-time interconnect (McCaskey et al., 2018, Cacheiro et al., 25 May 2025, Ramsauer et al., 25 Jul 2025, Tejedor et al., 25 Mar 2026, Raj et al., 17 Apr 2026). Across these forms, the recurring functional split is stable: CPUs provide orchestration and control flow, GPUs supply high-throughput classical numerics and simulation, and QPUs execute quantum kernels or modality-specific control tasks.
1. Architectural forms and system models
The literature distinguishes several integration regimes. A hardware-oriented survey separates Standalone (Loose – Standalone), Co-located (Loose – Co-located), Co-located (Tight – Co-located, multi-QPU), and On-node (Tight – On-node) configurations (Rallis et al., 24 Mar 2025). In the loose forms, QPUs are accessed as remote systems over Ethernet, InfiniBand, WAN, or cloud APIs; in the tight forms, QPUs or their controllers are attached through PCIe Gen4/5, CXL, MMIO-style integration, or low-latency local fabrics (Rallis et al., 24 Mar 2025, Ramsauer et al., 25 Jul 2025).
A system-level variant makes the QPU explicit as a classical peripheral analogous to a GPU or NIC. In this model, the QPU appears as a quantum accelerator card on the classical node, with a kernel-space driver and a Quantum Abstraction Layer (QAL) exposing the device as a character device such as /dev/qal0, with DMA, MSI/MSI-X, ioctl-based control, and anticipated mmap and SR-IOV support (Ramsauer et al., 25 Jul 2025). A related real-time design, NVQLink, defines a Logical QPU that includes the physical QPU, pulse-processing units (PPUs), a Real-time Host (RTH) equipped with CPUs and GPUs, and the interconnect joining them, so that GPUs become part of the QPU’s control stack rather than merely external post-processors (Caldwell et al., 29 Oct 2025).
At cluster scale, hybrid deployments are often organized around schedulers and workflow engines rather than board-level coupling. QMIO integrates a superconducting QPU into an HPC center through a gateway node and a Quantum Control Node (QCN), with the resource manager treating the QPU analogously to a GPU-attached resource (Cacheiro et al., 25 May 2025). The PCSS deployment uses two ORCA PT-1 photonic QPUs as network-attached devices exposed through HTTP REST, with Slurm providing multi-user control and CUDA-Q providing the hybrid programming layer (Slysz et al., 22 Aug 2025). A cloud-native alternative treats QPUs, GPUs, and CPUs as node classes inside a Kubernetes cluster, using containerized Pods, Argo DAGs, Kueue queues, and Prometheus/Grafana observability (Tejedor et al., 25 Mar 2026).
| Integration style | Defining traits | Representative systems |
|---|---|---|
| Loose or network-attached | Ethernet/WAN, REST or message-bus access, scheduler-mediated allocation | QMIO gateway/QCN (Cacheiro et al., 25 May 2025), PCSS PT-1 + Slurm/CUDA-Q (Slysz et al., 22 Aug 2025) |
| Tight co-located or on-node | PCIe/CXL/MMIO, DMA, MSI-X, kernel/device abstractions | QAL peripheral model (Ramsauer et al., 25 Jul 2025), NVQLink Logical QPU (Caldwell et al., 29 Oct 2025) |
| Cloud-native workflow fabric | Containerized DAGs across CPU/GPU/QPU nodes | Kubernetes + Argo + Kueue (Tejedor et al., 25 Mar 2026) |
This classification corrects a common oversimplification. Hybrid QPU-GPU architectures are not a single topology; they range from remote accelerator access to deeply integrated real-time machines. The principal distinction is not whether a GPU and a QPU coexist, but whether their interaction is batch-oriented, workflow-oriented, or latency-bounded.
2. Software stacks, programming abstractions, and control planes
The software layer is similarly stratified. One early formulation cast hybrid quantum programming explicitly in the classical co-processor mold, “akin to OpenCL or CUDA for GPUs,” with the host preparing quantum kernels, compilers lowering them to an intermediate representation, and accelerator backends mapping the IR to specific QPU APIs (McCaskey et al., 2018). XACC formalized this pattern with quantum kernels, an extensible IR, compiler plugins, and Accelerator abstractions that can target physical QPUs or simulators (McCaskey et al., 2018).
Later systems separate high-level workflow orchestration from low-level device control. Kubernetes-native hybrid workflows describe each stage as a containerized Argo step; GPUs are requested as extended resources such as nvidia.com/gpu: 1, while QPUs are currently steered through node labels like resource_type: qpu, with the longer-term migration path identified as Kubernetes Dynamic Resource Allocation (DRA) device classes and device claims (Tejedor et al., 25 Mar 2026). In that model, DAGs express fan-out/fan-in patterns for subcircuit generation, parallel execution, and reconstruction, and Secrets supply QPU credentials to Pods (Tejedor et al., 25 Mar 2026).
HPC-center deployments instead rely on batch schedulers and backend-specific runtimes. QMIO exposes framework-specific backends for Qiskit and PyTket, translates circuits into OpenQASM 2/3 or QIR, and uses a framework-agnostic QmioRuntimeService in qmio-run to send IR, shot count, output format, and repetition period over ZeroMQ to the QCN, where OQC’s QAT (Quantum Assembly Toolchain) lowers them to pulse-level execution (Cacheiro et al., 25 May 2025). At PCSS, Slurm manages QPU access through its license mechanism, while CUDA-Q and an ORCA backend present the photonic PT-1 as a multi-QPU target inside a unified C++/Python environment; a PTLayer integrates QPU invocations into classical neural-network code (Slysz et al., 22 Aug 2025).
At the lowest latency end, QAL and NVQLink push the abstraction boundary downward. QAL exports a QDMI-style interface, maintains kernel-managed queues of quantum execution sequences, and defines a split between kernel responsibilities—real-time scheduling, arbitration, context isolation—and user-space responsibilities such as complex transpilation and higher-level heuristics (Ramsauer et al., 25 Jul 2025). NVQLink extends CUDA-Q with __qpu__ kernels, cudaq::device_call, and cudaq::device_ptr<T>, so that a quantum kernel can synchronously invoke a function on a GPU, PPU, or other device with compiler-managed marshaling (Caldwell et al., 29 Oct 2025). This removes the HTTP/REST control path entirely from the latency-critical loop.
A plausible implication is that “hybrid” in this literature increasingly means not merely calling a remote QPU from a classical program, but compiling a single heterogeneous task graph across multiple device domains. The step from XACC’s accelerator abstraction to NVQLink’s device callbacks is the step from host-mediated offload to in-kernel cross-accelerator control.
3. Resource management, scheduling, and data movement
Resource management is a defining technical problem because GPUs and QPUs are scarce, nonuniform, and often exposed through different control planes. In Kubernetes-native workflows, the basic mechanism is queue- and label-based placement: Kubernetes scheduler performs bin-packing, while Kueue adds ResourceFlavors, LocalQueues, admission control, quotas, and fairness guarantees, especially for scarce GPUs and QPUs (Tejedor et al., 25 Mar 2026). Workflows opt into queues by labels such as kueue.x-k8s.io/queue-name, and data is exchanged through a shared PersistentVolume, which simplifies orchestration but introduces I/O bottlenecks during reconstruction (Tejedor et al., 25 Mar 2026).
In Slurm-based environments, the QPU is typically modeled as a special resource. QMIO first exposed the QCN itself as a Slurm node, then replaced that design with a gateway-node architecture after observing 1–3 s overhead per submission when each circuit was treated as a separate Slurm job; the replacement keeps a persistent ZeroMQ channel for the duration of a short allocation or a batch reservation of up to ~2 hours (Cacheiro et al., 25 May 2025). The PCSS deployment uses Slurm’s license management mechanism so that jobs request GPUs through --gres and QPUs through --licenses, while fair-share scheduling and per-device FIFO queues ensure multi-user access to two PT-1 systems (Slysz et al., 22 Aug 2025).
System-level proposals add topology-aware and latency-aware resource models. QAL frames the QPU as a kernel-managed queueing resource with DMA and interrupt-driven completion, explicitly raising the question of where to divide scheduling between kernel and user space (Ramsauer et al., 25 Jul 2025). QHPC generalizes this into a Unified Resource Registry (URR) that tracks not only CPU and GPU properties but QPU-specific metrics such as gate fidelities, coherence times, calibration timestamps, drift indicators, queue length, estimated wait time, and local-versus-remote access latency; on top of that it introduces a Quantum Suitability Score (QSS) that ranks candidate QPUs for a job by combining fidelity, coherence, topology match, wait time, and latency (Raj et al., 17 Apr 2026).
Data movement spans equally different regimes. At the cloud-native layer, all intermediate artifacts may be classical files on /mnt/shared or an object-like shared volume (Tejedor et al., 25 Mar 2026). At the scheduler-mediated HPC layer, requests are small IR payloads and measurement results sent over ZeroMQ or HTTP REST (Cacheiro et al., 25 May 2025, Slysz et al., 22 Aug 2025). At the system level, the baseline host-memory rendezvous model can be replaced by DMA, SR-IOV, peer-to-peer PCIe, or direct RDMA into GPU memory (Ramsauer et al., 25 Jul 2025). NVQLink demonstrates the extreme case: 100G Ethernet or InfiniBand with RoCE, GPUNetIO, and a dedicated PCIe switch connecting NIC and GPU so that packet traffic bypasses the CPU; in the proof-of-concept, FPGA→GPU→FPGA round-trip latency measured mean, standard deviation, and maximum observed latency (Caldwell et al., 29 Oct 2025).
These scheduling layers embody a central trade-off. Loose coupling gives simpler deployment and broader compatibility; tight coupling gives determinism and makes GPU–QPU feedback plausible on microsecond budgets. The literature does not treat one mode as universally superior; it aligns each mode to a workload class.
4. Computational roles of GPUs and QPUs
The division of labor between GPU and QPU is broader than “simulation versus hardware execution.” In the Kubernetes circuit-cutting prototype, GPU nodes are used as accelerators for classical simulation of quantum circuits, while CPUs generate fragments and reconstruct the global observable after QPU/CPU/GPU fragment execution (Tejedor et al., 25 Mar 2026). The reconstruction uses the standard gate-cutting pattern
which makes the orchestration problem one of massive classical fan-out and fan-in rather than a single monolithic quantum run (Tejedor et al., 25 Mar 2026).
In variational and quantum-machine-learning workloads, GPUs dominate the classical numerical envelope around the QPU. The QCQ or “QSandwich” architecture places a VQE-based state-preparation stage in the first quantum layer, a classical CNN on GPUs in the middle, and a small quantum circuit layer at the end; with PennyLane Lightning and cuQuantum, multi-GPU acceleration yields up to tenfold increases in computational speed over CPU-based methods in complex phase-transition classification tasks, and the TFIM/XXZ studies report test accuracies up to 99.5% (Chen et al., 2024). QHPC generalizes this pattern by assigning GPUs to Hamiltonian preprocessing, ML surrogates, circuit simulation, and post-processing, while QPUs execute VQE, QPE, QAOA, QSVM, and quantum neural-network kernels (Raj et al., 17 Apr 2026).
Some of the most latency-sensitive workloads invert the usual perspective and treat the GPU as part of the QPU’s internal control path. NVQLink places QEC decoding, online calibration, and QCVV analysis on RTH GPUs, with PPUs supplying syndrome streams and pulse-level control; the paper explicitly cites QEC decoding requirements up to approximately for Fusion Blossom on a 1000-qubit surface code and approximately for AI decoders for about 100 logical qubits, arguing that GPUs are “particularly well suited” because the decoding windows are batch-parallel (Caldwell et al., 29 Oct 2025). QAL makes a similar distinction between on-card or near-device control compute and higher-level runtimes, positioning the GPU as the pre-/post-processing engine around QPU kernels and potentially as one participant in a generic accelerator abstraction layer (Ramsauer et al., 25 Jul 2025).
Annealing-based hybrids instantiate yet another role split. For larger-than-QPU lattice Ising problems, the hybrid annealing algorithm maintains the global spin configuration classically, selects a region , constructs the conditional subproblem
solves that subproblem on the QPU, and then accepts the update only if the global energy decreases (Raymond et al., 2022). The paper uses CPUs for this loop, but the per-iteration workload—conditional field computation, energy evaluation, greedy cleanup, or simulated-annealing post-processing—is sparse and massively parallel, which suggests a natural GPU implementation.
A recurring misconception is that the GPU’s role is secondary whenever real QPUs are present. The sources do not support that view. GPUs appear as simulators, optimizers, tensor-network engines, ML backends, calibration engines, QEC decoders, and even direct participants in the QPU control loop.
5. Performance, scalability, reproducibility, and observability
Published systems report very different performance signatures because they operate at different layers. The Kubernetes workflow paper does not provide latency charts or speedup curves, but it reports sustained CPU and GPU utilization during parallel execution, identifies Pod startup and queue admission as acceptable overhead for medium- to large-duration subcircuit tasks, and names PersistentVolume I/O—especially during reconstruction—as the major bottleneck (Tejedor et al., 25 Mar 2026). It also identifies exclusive GPU allocation as a concurrency limiter for small simulator tasks and argues for future fractional allocation via DRA (Tejedor et al., 25 Mar 2026).
The QCQ multi-GPU study offers the clearest accelerator-scaling numbers at the workflow level. Using PennyLane Lightning plus cuQuantum with CUDA-aware MPI and NVLink, performance improves from 1 to 2 to 4 NVIDIA A100 GPUs, although the speedup “does not fully align with an ideal linear progression,” which the authors attribute partly to the modest qubit counts in the experiment (Chen et al., 2024). In a different context, the classical CPU/GPU simulation of Sycamore-class circuits uses a single A100 GPU for state construction and CPU jobs for distributed sampling: the 53-qubit, 20-cycle state-construction phase takes 748 s, the full 2.5 million-shot workload over 100 CPU jobs completes in 01:15:36, and the projected duration with 1,000 CPU jobs is approximately 00:17:35; for the 53-qubit, 14-cycle circuit, the reported linear XEB is 0.549, higher than Google’s published 0.002 reference-data score (Wold et al., 8 Dec 2025). This result does not describe a QPU-GPU architecture directly, but it matters because it sharpens the classical baseline against which hybrid systems are judged.
Operational considerations are equally prominent. QMIO schedules daily calibrations of 1- and 2-qubit gates and T1 measurements, weekly T2 and full randomized benchmarking, and notes that calibrations take about 2 hours daily, with an additional 1-hour manual tuning window if automated runs detect issues (Cacheiro et al., 25 May 2025). The PCSS deployment emphasizes that the two photonic PT-1 QPUs fit into a standard active data-center room, with no special networking, power, or cooling requirements beyond standard facilities, because each PT-1 is a room-temperature rack device drawing about 600 W (Slysz et al., 22 Aug 2025).
Reproducibility and observability are explicit design goals in cloud-native systems. Containerization fixes software environments across nodes; declarative Argo YAML captures DAG structure, resource requirements, volume mounts, and credential handling; Prometheus collects workflow state, resource usage, and QPU-specific metrics such as latency and queue time; Grafana surfaces timelines, utilization traces, queue depth, and throughput (Tejedor et al., 25 Mar 2026). Similar monitoring logic appears in QMIO, which correlates cryostat and environmental measurements with fidelity variation over 10 months (Cacheiro et al., 25 May 2025).
A plausible implication is that hybrid QPU-GPU performance can no longer be reduced to kernel throughput alone. The relevant envelope includes scheduler latency, calibration windows, storage contention, queue depth, network determinism, and the reproducibility of the classical stack around the quantum run.
6. Conceptual tensions, misconceptions, and future directions
Several tensions recur across the literature. The first is loose coupling versus tight coupling. Network-attached QPUs are already sufficient for multi-user HPC-center deployments and hybrid algorithms such as optimization and quantum machine learning (Slysz et al., 22 Aug 2025). At the same time, system-level work argues that error correction, pulse-level feedback, and some calibration workloads demand tens-of-microseconds or microseconds reaction times, which push the design toward PCIe/CXL/MMIO peripheralization, RDMA, and direct GPU participation in the control path (Ramsauer et al., 25 Jul 2025, Caldwell et al., 29 Oct 2025). These positions are not contradictory; they describe different parts of the workload space.
The second tension is specialized quantum stack versus mainstream infrastructure. One line of work argues that QPUs should be “just another kind of containerized step,” managed by Kubernetes, Argo Workflows, Kueue, and standard observability tools (Tejedor et al., 25 Mar 2026). Another line argues for a dedicated kernel subsystem such as QAL or a real-time interconnect such as NVQLink because cloud-style orchestration and HTTP interfaces are fundamentally too slow for low-level control (Ramsauer et al., 25 Jul 2025, Caldwell et al., 29 Oct 2025). This suggests a stratified architecture: cloud-native orchestration at application time, specialized OS and interconnect mechanisms at real-time and deterministic-time domains.
A third tension concerns the moving classical baseline. The Sycamore-class simulation results show that a hybrid GPU+CPU pipeline can erode earlier quantum advantage claims substantially (Wold et al., 8 Dec 2025). The literature therefore treats GPUs not only as support hardware for QPUs but as competitors and validators. This is why several frameworks make GPU emulation a first-class fallback path: QHPC’s R2 tier, QCQ’s cuQuantum/Lightning backend, and QHPC’s HWD policies all assume that quantum stages may run on GPUs when QPU wait times or quality are unfavorable (Chen et al., 2024, Raj et al., 17 Apr 2026).
Future directions are correspondingly multi-layered. Kubernetes-native work points toward DRA device classes, fractional GPU/QPU allocation, in-memory or distributed storage instead of a single shared volume, and richer QPU resource descriptors such as qubit count, noise level, and shot budget (Tejedor et al., 25 Mar 2026). OS-level work calls for a unified accelerator management subsystem, cross-accelerator events or doorbells, topology-aware resource managers, and standardized APIs for expressing dependencies across CPU, GPU, and QPU tasks (Ramsauer et al., 25 Jul 2025). QHPC proposes a tiered fabric in which R2 GPU nodes, R3 tightly integrated CPU+GPU+QPU nodes, and R4 remote QPUs are exposed through a single Hybrid Workload Descriptor and a scheduler driven by QPU quality and latency metrics (Raj et al., 17 Apr 2026). Edge-oriented work adds a more conservative lesson: because remote QPU offload remains ms–s and low-determinism while even on-device QPU paths are only ~ms (estimated), safety-critical or hard real-time systems require classical fallback paths and should treat QPU contributions as asynchronous or advisory unless tighter integration is available (Dey et al., 13 Mar 2026).
Taken together, these papers suggest that hybrid QPU-GPU architecture is not a transient NISQ expedient. Even in the presence of fault-tolerant QPUs, classical infrastructure remains central for task scheduling, data movement, simulators, tensor contractions, ML, calibration, observability, and, in some designs, the QPU’s own real-time control. The architectural problem is therefore not how to “add a QPU” to a GPU system, but how to define a coherent heterogeneous machine model in which GPUs and QPUs are both first-class resources with very different latency, determinism, and programming semantics.