Attribution of GPU underperformance: accelerator-specific tuning versus kernel granularity

Determine to what extent the limited performance improvements from GPU offloading of the annotated SPH kernels on the Grace–Hopper system are due to insufficient accelerator-specific tuning (e.g., warp-size tailoring) or launching too many small kernels, as opposed to inherent data-transfer and runtime overheads.

Background

The GPU experiments show comparatively low throughput and limited improvements despite the Grace–Hopper architecture’s tight memory coupling. The authors discuss possible reasons, including a lack of accelerator-specific tuning and excessive small kernel launches, and note that naive mapping of individual small compute kernels may not yield satisfactory performance.

They explicitly state uncertainty regarding how much of the observed underperformance is attributable to these factors, indicating the need for a systematic attribution paper.

References

However, it is not clear from the present setups, to which degree our data suffers from a lack of accelerator-specific tuning, such as the tailoring of warp sizes or simply too many, too small kernels.

— Annotation-guided AoS-to-SoA conversions and GPU offloading with data views in C++ (2502.16517 - Radtke et al., 23 Feb 2025) in Section 6.4 Results — GPU offloading

Attribution of GPU underperformance: accelerator-specific tuning versus kernel granularity

Background

References

Related Problems