Portability of Fortran do concurrent GPU offload beyond NVIDIA

Determine the extent to which the performance and correctness results achieved when offloading Fortran’s do concurrent (DC) loops to NVIDIA GPUs extend to other GPU vendors, specifically Intel and AMD GPUs using their respective compilers and toolchains.

Background

The paper investigates the use of Fortran’s do concurrent (DC) construct for GPU offload as a standard-language alternative to external directive or vendor-specific APIs. Prior work has demonstrated promising results on NVIDIA GPUs across benchmarks and production codes, suggesting that DC can match or surpass performance achieved with OpenACC or OpenMP target offload in some contexts.

Despite these successes, DC support on non-NVIDIA platforms has only recently emerged (Intel IFX and HPE CCE for certain AMD GPUs), leaving uncertainty about how well the observed results on NVIDIA translate to other vendors. This motivates a systematic evaluation of DC portability and performance across Intel and AMD GPUs using the HipFT production code as a test case.

References

While there have been previous promising results on the NVIDIA platform (see the next section), how well they extend to other vendors is an open question.

Portability of Fortran's `do concurrent' on GPUs  (2408.07843 - Caplan et al., 2024) in Section 1 Introduction