Attribution of GPU underperformance: accelerator-specific tuning versus kernel granularity
Determine to what extent the limited performance improvements from GPU offloading of the annotated SPH kernels on the Grace–Hopper system are due to insufficient accelerator-specific tuning (e.g., warp-size tailoring) or launching too many small kernels, as opposed to inherent data-transfer and runtime overheads.
References
However, it is not clear from the present setups, to which degree our data suffers from a lack of accelerator-specific tuning, such as the tailoring of warp sizes or simply too many, too small kernels.
— Annotation-guided AoS-to-SoA conversions and GPU offloading with data views in C++
(2502.16517 - Radtke et al., 23 Feb 2025) in Section 6.4 Results — GPU offloading