Portability of Fortran's `do concurrent' on GPUs (2408.07843v2)

Published 14 Aug 2024 in cs.PL, astro-ph.SR, cs.CE, cs.MS, and cs.PF

Abstract: There is a continuing interest in using standard language constructs for accelerated computing in order to avoid (sometimes vendor-specific) external APIs. For Fortran codes, the {\tt do concurrent} (DC) loop has been successfully demonstrated on the NVIDIA platform. However, support for DC on other platforms has taken longer to implement. Recently, Intel has added DC GPU offload support to its compiler, as has HPE for AMD GPUs. In this paper, we explore the current portability of using DC across GPU vendors using the in-production solar surface flux evolution code, HipFT. We discuss implementation and compilation details, including when/where using directive APIs for data movement is needed/desired compared to using a unified memory system. The performance achieved on both data center and consumer platforms is shown.

Summary

The paper assesses Fortran’s do concurrent portability for GPU offload using the HipFT solar flux model across NVIDIA, Intel, and AMD platforms.
It demonstrates that unified memory on NVIDIA GPUs can outperform manual memory management, while Intel and AMD require further compiler optimizations.
The findings underscore the potential of standard Fortran constructs to enable efficient and portable high-performance computing across diverse GPU architectures.

Portability of Fortran's 'do concurrent' on GPUs

The paper "Portability of Fortran's do concurrent' on GPUs" explores the applicability and performance of Fortran'sdo concurrent' (DC) construct for GPU offload across multiple vendors, specifically NVIDIA, Intel, and AMD. The motivation stems from the increasing need to use standard language constructs for accelerated computing to avoid dependence on vendor-specific external APIs.

Fortran's DC loop, designed for parallelism, can be leveraged by compilers to parallelize loops with no data dependencies, thus potentially enabling parallel execution on both multi-core CPUs and GPU accelerators. While DC has been successfully demonstrated on NVIDIA GPUs, support on other platforms has evolved more recently. This paper investigates the current status of DC portability by employing the High Performance Flux Transport (HipFT) code, a production solar surface flux evolution model.

Implementation and Testing

The authors provide an insightful examination of building and executing the HipFT code across various hardware configurations:

NVIDIA GPUs: Utilizing the NV HPC SDK, the paper compares manual memory management using OpenMP target data directives and automatic memory management (mem:managed and mem:unified). The results show that managed memory achieves similar performance to manual management, with unified memory yielding the best performance on GH200 GPUs.
Intel GPUs: The Intel HPC compiler, utilizing OpenMP target back-end for DC offloading, currently requires manual data management for optimal performance. The addition of OpenMP parallel loop directives to nested DC loops significantly improved the performance, demonstrating the temporary need for some manual interventions until compiler optimizations are finalized.
AMD GPUs: Using the Cray Compiler Environment, the paper attempted to run HipFT on AMD unified memory architecture GPUs (MI250X and MI300A). While the code compiled and passed tests, performance was not optimal, indicating that further compiler development is necessary.
CPU Comparisons: The paper also benchmarks the performance of the HipFT code on CPUs, providing a reference for GPU results. The results indicate that while modern consumer GPUs from NVIDIA and Intel achieve reasonable performance, server GPUs like NVIDIA's H100 and Intel's data center GPUs perform substantially better.

Numerical Results and Analysis

The results are presented in terms of the timing of different computational components of the HipFT code:

NVIDIA Platform: The GH200 in unified memory mode outperforms both manually managed and managed memory cases, demonstrating the potential of DC constructs in pure Fortran.
Intel Platform: With slight code modifications, Intel GPUs provide competitive performance, indicating that DC portability to Intel GPUs is feasible, although compiler improvements are ongoing.
AMD Platform: While functional correctness on AMD GPUs is achieved, optimizing performance remains a work in progress.

The paper's numerical results reinforce the viability of using DC in Fortran for GPU offload across different vendors, highlighting the increasing portability of standard language parallelism.

Implications and Future Directions

The findings have both practical and theoretical implications. They illustrate that with ongoing compiler developments, the goal of using standard language constructs like DC for high-performance computing across diverse hardware can be realized without significant performance trade-offs. The paper underscores the importance of continuing to improve standard language constructs and their compiler implementations.

Future work should focus on further optimizing DC support for less mature platforms like AMD and enhancing the efficiency of heterogeneous computing environments. Additionally, testing larger and more complex codes, especially those with features like GPU-aware MPI and derived type arrays, will provide a more comprehensive understanding of the limitations and capabilities of DC across platforms.

In conclusion, this research illuminates the path towards more portable and efficient accelerated computing in Fortran, fostering a more unified approach in leveraging diverse GPU architectures for scientific computing.

PDF Markdown

Related Papers

Tweets

https://twitter.com/tonymongkolsmai/status/1826255101830553650

https://twitter.com/sumseq/status/1824544121056608685

https://twitter.com/sumseq/status/1826723291845066916

HackerNews

Portability of Fortran's `do concurrent' on GPUs (2 points, 0 comments)