Component-level breakdown of cudaMemcpyAsync per-call overhead

Determine a detailed breakdown of the approximately 5–7 microseconds per-call latency overhead observed for cudaMemcpyAsync when copying many small data chunks, identifying the contributions of user-space runtime, kernel driver, and hardware interactions in the NVIDIA CUDA stack.

Background

To isolate latency-sensitive collective communications, DualPath routes all data transfers through the compute NIC using GPUDirect RDMA and compares this approach against CUDA’s copy engine. The authors observe that cudaMemcpyAsync exhibits a per-call overhead of roughly 5–7 microseconds for small chunks, while RDMA work requests have much lower submission overhead.

They explicitly state they were unable to further break down the source of this cudaMemcpyAsync overhead due to the closed-source nature of the CUDA driver, leaving the detailed attribution of this overhead unresolved.

References

Our measurements show that submitting a single copy operation via cudaMemcpyAsync incurs a latency overhead of approximately {5}-{7}$\mu s$. We failed to further break down this overhead due to the closed-source nature of CUDA driver.

DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference  (2602.21548 - Wu et al., 25 Feb 2026) in Section 5.2, CNIC-Assisted KV-Cache Copy