Developing a BLAS Library for the AMD AI Engine: A Technical Overview
The paper presents aieblas—a substantial effort to create an open-source library implementing Basic Linear Algebra Subprograms (BLAS) specifically for the AMD AI Engine (AIE). Notably, the AIE is an instance of spatial architecture designed to overcome the limitations faced by traditional von Neumann architectures, especially as the industry hits the physical limits of Dennard scaling and Moore’s law.
Context and Motivation
Spatial architectures, comprising large arrays of processing elements (PEs) organized in a grid and interconnected via a high-speed, reconfigurable Network-On-Chip (NoC), are gaining traction predominantly for ML tasks. Examples include AMD/Xilinx ACAP, Sambanova Reconfigurable Dataflow Architecture, and the Cerebras Wafer Scale Engine. These systems favor a dataflow programming model that effectively reduces control overheads. However, their utility outside the ML domain remains underexplored, largely due to a steep learning curve and lack of reusable software resources.
Objectives and Design Principles
The aieblas project targets bridging this gap by providing a reusable, customizable BLAS library for the AMD AIE. The design principles of aieblas are centered on:
- Simplification: Lowering barriers for non-experts to utilize spatial architectures.
- Flexibility: Facilitating easy expansion with new functionalities and optimizations.
- Efficiency: Favoring on-chip communications through dataflow approaches to minimize performance overheads associated with off-chip memory access.
Technical Implementation
aieblas leverages JSON files for high-level specifications of BLAS routines, enabling user-specific customization without necessitating deep hardware knowledge. The implementation involves:
- Specifying routine characteristics and parameters via JSON.
- Automatic code generation for multiple design components, including AIE kernels, PL kernels, dataflow graphs, and CMake projects.
aieblas ensures streamlined on-chip data handling using windows stored in local memory, optimizing data path width compared to traditional AXI4 streaming interfaces. This approach also decouples communication between AIEs, which is crucial for on-chip efficiency.
Performance Insights
The performance of aieblas has been evaluated on the AMD VCK5000 board:
- For single routines like vector addition (axpy) and matrix-vector multiplication (gemv), on-chip data generation outperformed PL kernel-based implementations, underscoring the impact of off-chip memory accesses.
- Composing multiple routines, such as axpydot, using a dataflow approach yielded performance improvements by minimizing expensive off-chip memory accesses and facilitating pipelined execution.
- Despite these gains, execution times were still up to ten times higher than optimized multicore CPU implementations, suggesting the necessity for further optimizations in off-chip memory access and spatial parallelism.
Implications and Future Directions
The development of aieblas signifies an important step towards making spatial architectures more accessible and useful beyond ML applications. The initial results emphasize the critical role of on-chip communication and pipelined execution in achieving high performance. However, achieving the full potential of AIE will require:
- Enhanced off-chip memory access optimizations.
- Support for multi-AIE routines to exploit the full spectrum of available spatial parallelism.
- Integration of more complex routines to extend BLAS coverage.
By releasing aieblas as an open-source library, the authors aim to cultivate community-driven development and optimization, potentially catalyzing advancements in the use of spatial architectures across various scientific domains.
Conclusion
The aieblas project represents a meaningful contribution to the software ecosystem of AMD AI Engine-based spatial architectures. While the initial implementation shows promising results, substantial room for optimization and expansion exists. Future developments focusing on memory access, routine parallelism, and community involvement will be pivotal in realizing the broader applicability and effectiveness of spatial architectures.