A Controlled Study of Memory Hierarchy Transitions in Quantum Circuit Simulation on Apple M4 Pro Unified Memory Architecture

Published 9 May 2026 in cs.PF and quant-ph | (2605.08792v2)

Abstract: State-vector quantum circuit simulation is memory-bandwidth bound, yet the interaction between memory hierarchy, access pattern, and hardware parallelism remains incompletely characterized. We address this using the Apple M4 Pro Unified Memory Architecture (UMA), where CPU and GPU share identical physical LPDDR5X DRAM ($\sim$224 GB/s STREAM bandwidth for both), eliminating memory-technology and interconnect confounds. Using a thermally isolated, multi-trial methodology across 11 simulation backends on GHZ and QFT circuits from 3 to 30 qubits, we make three central contributions. First, a Roofline analysis confirms all gate implementations have arithmetic intensity $\leq$0.38 FLOP/byte, well below the ridge point for any plausible peak compute on modern hardware, establishing structural memory-boundedness. Second, we identify a reproducible 4.46$\times$ timing discontinuity at the 28$\rightarrow$29 qubit transition, confirmed under thermally isolated conditions and cross-validated across GHZ and QFT circuits; tensordot backends exhibit the full discontinuity while direct-index backends maintain $\sim$2$\times$ per-qubit scaling throughout. Third, despite STREAM predicting only 1.85$\times$ GPU speedup (MLX CPU 119.9 GB/s vs. MLX GPU 221.9 GB/s), all three algorithm classes exceed this prediction: tensordot 3.1--4.1$\times$, flat-index 3.5--5.9$\times$, and direct-index 6--10$\times$, demonstrating that peak streaming bandwidth does not predict simulation speedup for non-contiguous memory access patterns, with the gap widening as access irregularity increases. These findings provide a hardware-characterization framework for quantum simulation workloads on UMA.

Abstract PDF Upgrade to Chat

Authors (1)

Gyan Pratipat

Summary

The paper presents a controlled study demonstrating memory hierarchy-induced throughput discontinuities (DRAM cliff) in state-vector quantum simulation.
It uses rigorous thermal control and multifaceted simulation backends on Apple M4 Pro UMA to isolate memory access patterns and quantify performance across quantum circuits.
Results reveal that STREAM bandwidth is inadequate to predict real-world performance, emphasizing the need for access-pattern-specific benchmarking.

Controlled Microarchitectural Characterization of State-Vector Quantum Simulation on Apple M4 Pro Unified Memory

Introduction

This paper presents a systematic characterization of memory hierarchy effects in state-vector quantum circuit simulation using the Apple M4 Pro SoC, leveraging its unified memory architecture (UMA) with physically shared LPDDR5X DRAM. The study isolates the impact of memory access patterns and parallelism on simulation throughput by eliminating confounds present in conventional heterogeneous systems (distinct CPU and GPU memory, multiple controllers, discrete interconnect). By employing an experimentally rigorous, thermally controlled methodology across multiple simulation backends and quantum circuit classes, the paper provides nuanced insight into the structural determinants of performance bottlenecks for large-scale state-vector simulation.

Methodology

Hardware and Backends

All benchmarks target the Apple M4 Pro (14-core CPU, 20-core GPU, 48 GB LPDDR5X) with careful validation of memory bandwidth using a STREAM-like probe. Simulation backends span brute-force, tensordot, flat-index, and direct-index kernels, implemented in NumPy, JAX, and Apple's MLX, each mapped to both CPU and GPU (where supported). This rigorous cross-backend and cross-hardware evaluation enables decoupled attribution of observed effects.

Benchmark Circuits

Experiments employ both GHZ and Quantum Fourier Transform (QFT) circuits from 3 to 30 qubits to control for circuit-specific effects, with correctness validated against Qiskit Aer. Careful thermal isolation (90s recovery between sequential backend runs) ensures that timing reflects microarchitectural realities, not thermal throttling.

Roofline Model Position and Structural Memory-Boundedness

Roofline analysis quantifies arithmetic intensity (AI) for all quantum gate types under all algorithmic implementations. Across all cases, AI remains below 0.38 FLOP/byte—deep within the memory-bound regime for the M4 Pro. This establishes that state-vector simulation speed is dominated by memory traffic rather than computation throughout the evaluated qubit range, regardless of backend or hardware (2605.08792).

DRAM Bandwidth Cliff: Discontinuity in Working-Set Scaling

A central result is the reproducible identification of a "DRAM cliff" at the 28→29 qubit transition. For tensordot backends, the runtime per additional qubit (i.e., as state vector doubles in size) exhibits a severe discontinuity—measured at 4.46× for JAX CPU tensordot and 3.16× for MLX GPU tensordot on GHZ circuits under thermal isolation. This step exceeds both the ideal 2× doubling from complexity scaling and the behavior of direct-index backends (~2× throughout). The effect is algorithm- and hardware-independent, replicating for QFT circuits and across all thermally isolated runs.

The analysis links this cliff to the degradation of prefetching and contiguous-block access optimizations: while working set already exceeds cache before 28 qubits, contiguous memory access (in tensordot backends) sustains higher effective throughput until a critical size (state vector size ~4.3 GB at 29 qubits) induces a performance collapse. Direct-index backends, with highly irregular stride patterns, do not benefit from such optimizations and therefore display consistent 2× scaling per additional qubit with no equivalent cliff.

STREAM Bandwidth as an Insufficient Predictor

Empirical results demonstrate that GPU-to-CPU simulation speedup exceeds the predictive ratio from STREAM bandwidth measurement (1.85× for MLX GPU:CPU). Specifically, observed speedups are: tensordot (3.1–4.1×), flat-index (3.5–5.9×, with artifactual spikes from mismatched cliff positions), and direct-index (6–10×). This divergence is most pronounced for backends with highly non-contiguous (strided or scatter-write) memory access. The data confirm that STREAM bandwidth captures peak sequential throughput but not hardware-parallel request issuance or kernel-level scatter/gather optimizations—both dominant in irregular simulation patterns. This invalidates the practice of using STREAM bandwidth as a sole proxy when predicting simulation performance for state-vector quantum workloads.

Implications for Simulation Backends and Hardware Selection

These results refute any direct proxy between raw bandwidth and application-level throughput for non-contiguous access kernels, emphasizing the need for workload- and access-pattern-specific benchmarking when evaluating hardware or designing quantum simulation frameworks. Notably, benchmarks below the cliff boundary underestimate the magnitude of forthcoming throughput collapse in tensordot-style implementations.

These findings generalize beyond Apple Silicon to other UMA systems (Qualcomm Snapdragon X Elite, AMD Ryzen AI Max, Intel Lunar Lake, NVIDIA Grace Hopper), with the methodology being independent of exact DRAM or interconnect implementation.

Methodological Contributions

The paper highlights the necessity of thermal control in benchmarking: without thermal stabilization, timing artifacts substantially distort cliff magnitude and location, as observed with JAX CPU backends in non-isolated settings. The controlled benchmark harness hence provides a validated protocol for future microarchitectural simulation studies and for cross-platform comparison.

Limitations

The study is restricted to 48 GB DRAM (ceiling at 30 qubits for single-precision complex vectors), single SoC vendor, and complex64 arithmetic; production deployments often use complex128, doubling the memory constraint. CPU backends could benefit further from explicit multi-threading as seen in HPC simulation frameworks; the ratios and absolute timings may thus shift, but the qualitative findings—cliff discontinuity, access-pattern dependence, and the inadequacy of STREAM bandwidth for prediction—are robust.

Theoretical and Practical Implications

From a theoretical perspective, the work clarifies that algorithmic complexity and arithmetic intensity are insufficient to predict microarchitectural throughput transitions in state-vector simulation; instead, memory access patterns and hardware-level request parallelism become the principal determinants. For practitioners, these findings direct hardware selection and simulation backend design for quantum algorithm validation, especially as consumer SoCs with UMA proliferate in desktops and laptops.

Most importantly, the study underscores that representative, high-qubit-count benchmarking—rather than extrapolation from low-qubit or synthetic bandwidth metrics—is required for accurate simulation capacity estimation. This is critical as the quantum simulation community approaches the current classical limit for feasible state-vector simulation.

Conclusion

This controlled study rigorously establishes the existence and magnitude of a memory hierarchy-induced throughput discontinuity ("DRAM cliff") in state-vector quantum simulation on physically unified DRAM, elucidates its dependence on kernel memory access pattern rather than circuit structure or algorithmic complexity, and demonstrates that peak STREAM bandwidth is a systematically insufficient proxy for real workload-level speedup, especially for kernels with irregular memory accesses. The results have significant methodological and practical implications for quantum simulation benchmarking, backend selection, and the analysis of heterogeneous UMA platforms (2605.08792).

Markdown Report Issue