- The paper introduces an automated compiler enhancement that maps Fortran intrinsics to AMD AI Engines via MLIR, enabling transparent acceleration without modifying source code.
- It leverages the open-source Flang compiler and a custom xrt_wrapper dialect to transform and offload linear algebra operations efficiently.
- Performance evaluations confirm that repeated intrinsic calls, especially matrix multiplication, achieve competitive or superior speedups compared to CPU execution.
Seamless Acceleration of Fortran Intrinsics via AMD AI Engines
This paper addresses a pressing challenge in the high-performance computing (HPC) landscape: enhancing computational performance in Fortran applications by transparently harnessing the power of AMD's AI Engines (AIEs). Fortran remains a dominant language in scientific computing due to its performance capabilities and the maturity of its ecosystem. Nonetheless, leveraging cutting-edge hardware, such as the AMD AIEs found in their Ryzen AI CPUs, requires a sophisticated understanding of these architectures, presenting a significant barrier to entry for many practitioners.
The researchers propose an innovative methodology to automatically offload Fortran intrinsic procedures onto the AMD AIEs without necessitating any modifications to the original Fortran code. Utilizing the capabilities of the open-source Flang compiler and the MLIR (Multi-Level Intermediate Representation) ecosystem, the authors have extended the Flang compilation process to map Fortran intrinsics to a linear algebra (linalg) dialect within MLIR. This representation enables the efficient harnessing of the computational capabilities of the AIEs via an MLIR-to-AIE transformation flow.
A primary advancement in this research is the development of the xrt_wrapper MLIR dialect, enabling seamless interaction between the CPU and AIEs via the Xilinx Runtime (XRT). The approach also involves the construction of a library of pre-defined MLIR templates for a variety of linear algebra operations, which are dynamically adapted based on the specific invocation of Fortran intrinsics in user applications.
Performance evaluations of the proposed system demonstrate that for workloads with repeated calls to specific Fortran intrinsics, such as reductions and transpositions, the overhead associated with initial execution on the AIE diminishes on subsequent runs, resulting in performance competitive with, or superior to, CPU execution. Particularly, the handling of matrix multiplication intrinsic (matmul) was shown to benefit from AIE-specific optimizations and delivered significant performance gains over CPU execution.
From a practical standpoint, this research presents a compelling case for integrating AI engines into routine scientific workflows via automated compiler techniques, thus democratizing access to hardware acceleration for the broader Fortran community. Theoretically, it also positions MLIR as a central tool not only for language-agnostic optimizations but also for specialized hardware acceleration tasks. Moving forward, extending such techniques to include a more comprehensive set of Fortran patterns beyond intrinsics, along with integration into ML frameworks that utilize the linalg dialect, presents exciting opportunities for future research and toolchain development.
Ultimately, this research contributes significantly to the broader objective of improving accessibility and utility of emerging hardware platforms like AMD AIEs in established computational domains, showcasing how transparent acceleration frameworks can be developed within the evolving landscape of compiler technologies.