Component-level breakdown of cudaMemcpyAsync per-call overhead
Determine a detailed breakdown of the approximately 5–7 microseconds per-call latency overhead observed for cudaMemcpyAsync when copying many small data chunks, identifying the contributions of user-space runtime, kernel driver, and hardware interactions in the NVIDIA CUDA stack.
References
Our measurements show that submitting a single copy operation via cudaMemcpyAsync incurs a latency overhead of approximately {5}-{7}$\mu s$. We failed to further break down this overhead due to the closed-source nature of CUDA driver.
— DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference
(2602.21548 - Wu et al., 25 Feb 2026) in Section 5.2, CNIC-Assisted KV-Cache Copy