- The paper assesses Fortran’s do concurrent portability for GPU offload using the HipFT solar flux model across NVIDIA, Intel, and AMD platforms.
- It demonstrates that unified memory on NVIDIA GPUs can outperform manual memory management, while Intel and AMD require further compiler optimizations.
- The findings underscore the potential of standard Fortran constructs to enable efficient and portable high-performance computing across diverse GPU architectures.
Portability of Fortran's 'do concurrent' on GPUs
The paper "Portability of Fortran's do concurrent' on GPUs" explores the applicability and performance of Fortran's
do concurrent' (DC) construct for GPU offload across multiple vendors, specifically NVIDIA, Intel, and AMD. The motivation stems from the increasing need to use standard language constructs for accelerated computing to avoid dependence on vendor-specific external APIs.
Fortran's DC loop, designed for parallelism, can be leveraged by compilers to parallelize loops with no data dependencies, thus potentially enabling parallel execution on both multi-core CPUs and GPU accelerators. While DC has been successfully demonstrated on NVIDIA GPUs, support on other platforms has evolved more recently. This paper investigates the current status of DC portability by employing the High Performance Flux Transport (HipFT) code, a production solar surface flux evolution model.
Implementation and Testing
The authors provide an insightful examination of building and executing the HipFT code across various hardware configurations:
- NVIDIA GPUs: Utilizing the NV HPC SDK, the paper compares manual memory management using OpenMP target data directives and automatic memory management (
mem:managed
and mem:unified
). The results show that managed memory achieves similar performance to manual management, with unified memory yielding the best performance on GH200 GPUs.
- Intel GPUs: The Intel HPC compiler, utilizing OpenMP target back-end for DC offloading, currently requires manual data management for optimal performance. The addition of OpenMP parallel loop directives to nested DC loops significantly improved the performance, demonstrating the temporary need for some manual interventions until compiler optimizations are finalized.
- AMD GPUs: Using the Cray Compiler Environment, the paper attempted to run HipFT on AMD unified memory architecture GPUs (MI250X and MI300A). While the code compiled and passed tests, performance was not optimal, indicating that further compiler development is necessary.
- CPU Comparisons: The paper also benchmarks the performance of the HipFT code on CPUs, providing a reference for GPU results. The results indicate that while modern consumer GPUs from NVIDIA and Intel achieve reasonable performance, server GPUs like NVIDIA's H100 and Intel's data center GPUs perform substantially better.
Numerical Results and Analysis
The results are presented in terms of the timing of different computational components of the HipFT code:
- NVIDIA Platform: The GH200 in unified memory mode outperforms both manually managed and managed memory cases, demonstrating the potential of DC constructs in pure Fortran.
- Intel Platform: With slight code modifications, Intel GPUs provide competitive performance, indicating that DC portability to Intel GPUs is feasible, although compiler improvements are ongoing.
- AMD Platform: While functional correctness on AMD GPUs is achieved, optimizing performance remains a work in progress.
The paper's numerical results reinforce the viability of using DC in Fortran for GPU offload across different vendors, highlighting the increasing portability of standard language parallelism.
Implications and Future Directions
The findings have both practical and theoretical implications. They illustrate that with ongoing compiler developments, the goal of using standard language constructs like DC for high-performance computing across diverse hardware can be realized without significant performance trade-offs. The paper underscores the importance of continuing to improve standard language constructs and their compiler implementations.
Future work should focus on further optimizing DC support for less mature platforms like AMD and enhancing the efficiency of heterogeneous computing environments. Additionally, testing larger and more complex codes, especially those with features like GPU-aware MPI and derived type arrays, will provide a more comprehensive understanding of the limitations and capabilities of DC across platforms.
In conclusion, this research illuminates the path towards more portable and efficient accelerated computing in Fortran, fostering a more unified approach in leveraging diverse GPU architectures for scientific computing.