Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Developing a BLAS library for the AMD AI Engine (2410.00825v1)

Published 1 Oct 2024 in cs.DC and cs.ET

Abstract: Spatial (dataflow) computer architectures can mitigate the control and performance overhead of classical von Neumann architectures such as traditional CPUs. Driven by the popularity of Machine Learning (ML) workloads, spatial devices are being marketed as ML inference accelerators. Despite providing a rich software ecosystem for ML practitioners, their adoption in other scientific domains is hindered by the steep learning curve and lack of reusable software, which makes them inaccessible to non-experts. We present our ongoing project AIEBLAS, an open-source, expandable implementation of Basic Linear Algebra Routines (BLAS) for the AMD AI Engine. Numerical routines are designed to be easily reusable, customized, and composed in dataflow programs, leveraging the characteristics of the targeted device without requiring the user to deeply understand the underlying hardware and programming model.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Tristan Laan (2 papers)
  2. Tiziano De Matteis (13 papers)

Summary

Developing a BLAS Library for the AMD AI Engine: A Technical Overview

The paper presents aieblas—a substantial effort to create an open-source library implementing Basic Linear Algebra Subprograms (BLAS) specifically for the AMD AI Engine (AIE). Notably, the AIE is an instance of spatial architecture designed to overcome the limitations faced by traditional von Neumann architectures, especially as the industry hits the physical limits of Dennard scaling and Moore’s law.

Context and Motivation

Spatial architectures, comprising large arrays of processing elements (PEs) organized in a grid and interconnected via a high-speed, reconfigurable Network-On-Chip (NoC), are gaining traction predominantly for ML tasks. Examples include AMD/Xilinx ACAP, Sambanova Reconfigurable Dataflow Architecture, and the Cerebras Wafer Scale Engine. These systems favor a dataflow programming model that effectively reduces control overheads. However, their utility outside the ML domain remains underexplored, largely due to a steep learning curve and lack of reusable software resources.

Objectives and Design Principles

The aieblas project targets bridging this gap by providing a reusable, customizable BLAS library for the AMD AIE. The design principles of aieblas are centered on:

  1. Simplification: Lowering barriers for non-experts to utilize spatial architectures.
  2. Flexibility: Facilitating easy expansion with new functionalities and optimizations.
  3. Efficiency: Favoring on-chip communications through dataflow approaches to minimize performance overheads associated with off-chip memory access.

Technical Implementation

aieblas leverages JSON files for high-level specifications of BLAS routines, enabling user-specific customization without necessitating deep hardware knowledge. The implementation involves:

  • Specifying routine characteristics and parameters via JSON.
  • Automatic code generation for multiple design components, including AIE kernels, PL kernels, dataflow graphs, and CMake projects.

aieblas ensures streamlined on-chip data handling using windows stored in local memory, optimizing data path width compared to traditional AXI4 streaming interfaces. This approach also decouples communication between AIEs, which is crucial for on-chip efficiency.

Performance Insights

The performance of aieblas has been evaluated on the AMD VCK5000 board:

  • For single routines like vector addition (axpy) and matrix-vector multiplication (gemv), on-chip data generation outperformed PL kernel-based implementations, underscoring the impact of off-chip memory accesses.
  • Composing multiple routines, such as axpydot, using a dataflow approach yielded performance improvements by minimizing expensive off-chip memory accesses and facilitating pipelined execution.
  • Despite these gains, execution times were still up to ten times higher than optimized multicore CPU implementations, suggesting the necessity for further optimizations in off-chip memory access and spatial parallelism.

Implications and Future Directions

The development of aieblas signifies an important step towards making spatial architectures more accessible and useful beyond ML applications. The initial results emphasize the critical role of on-chip communication and pipelined execution in achieving high performance. However, achieving the full potential of AIE will require:

  • Enhanced off-chip memory access optimizations.
  • Support for multi-AIE routines to exploit the full spectrum of available spatial parallelism.
  • Integration of more complex routines to extend BLAS coverage.

By releasing aieblas as an open-source library, the authors aim to cultivate community-driven development and optimization, potentially catalyzing advancements in the use of spatial architectures across various scientific domains.

Conclusion

The aieblas project represents a meaningful contribution to the software ecosystem of AMD AI Engine-based spatial architectures. While the initial implementation shows promising results, substantial room for optimization and expansion exists. Future developments focusing on memory access, routine parallelism, and community involvement will be pivotal in realizing the broader applicability and effectiveness of spatial architectures.

HackerNews