Modeling queuing delays and resource contention in GPU API remoting performance

Develop a remoting cost model for GPU API remoting that explicitly incorporates queuing delays and resource contention, enabling the model to account for and accurately predict deviations between theoretical/emulation results and real hardware measurements of AI applications using RDMA or SHM backends.

Background

The paper derives a GPU-centric cost model for API remoting that expresses overhead as a function of network round-trip time and bandwidth, separating synchronous and asynchronous APIs and incorporating optimizations such as outstanding requests and shadow descriptors. The model is validated via emulation and real hardware experiments.

However, when comparing with real hardware, the authors note deviations attributed to factors not captured by their current model—specifically queuing delays and resource contention—as well as fluctuations in profiled constants. They explicitly state they are unable to model these effects, indicating a gap for more comprehensive modeling that incorporates these system-level dynamics.

References

Note that the results on real hardware may deviate from our theoretical (and emulation) model. This is due to the fact that we are unable to model the queuing delays and resource contentions, as well as the fact that the profile of several constants (e.g., $Start$ and $Time(api)$ in {eq:cost}) may have fluctuations.

Characterizing Network Requirements for GPU API Remoting in AI Applications (2401.13354 - Wang et al., 24 Jan 2024) in Section 5.2, When requirements meet real hardware