Unlocking the AMD Neural Processing Unit for ML Training on the Client Using Bare-Metal-Programming Tools

Published 3 Apr 2025 in cs.AR | (2504.03083v1)

Abstract: There has been a growing interest in executing ML workloads on the client side for reasons of customizability, privacy, performance, and availability. In response, hardware manufacturers have begun to incorporate so-called Neural Processing Units (NPUs) into their processors for consumer devices. Such dedicated hardware optimizes both power efficiency and throughput for common machine learning tasks. AMD's NPU, part of their Ryzen AI processors, is one of the first such accelerators integrated into a chip with an x86 processor. AMD supports bare-metal programming of their NPU rather than limiting programmers to pre-configured libraries. In this paper, we explore the potential of using a bare-metal toolchain to accelerate the weight fine-tuning of a LLM, GPT-2, entirely on the client side using the AMD NPU. Fine-tuning on the edge allows for private customization of a model to a specific use case. To the best of our knowledge, this is the first time such an accelerator has been used to perform training on the client side. We offload time-intensive matrix multiplication operations from the CPU onto the NPU, achieving a speedup of over 2.8x for these operations. This improves end-to-end performance of the model in terms of throughput (1.7x and 1.2x speedup in FLOPS/s on mains and battery power, respectively) and energy efficiency (1.4x improvement in FLOPS/Ws on battery power). We detail our implementation approach and present an in-depth exploration of the NPU hardware and bare-metal tool-flow.

Abstract PDF Upgrade to Chat

Summary

The paper presents a novel approach to client-side ML training by leveraging AMD's NPU and bare-metal programming tools.
It achieves a 2.8x speedup in GEMM operations compared to CPU-only execution, showcasing significant performance improvements.
The study employs AMD's XDNA architecture and IRON toolchain to optimize data routing and balance power efficiency with computational throughput.

Unlocking the AMD Neural Processing Unit for Client-side ML Training

This paper details the implementation of ML model training on the client side using the AMD Neural Processing Unit (NPU) and bare-metal programming tools. The study demonstrates the potential for customizing ML workloads on consumer devices while optimizing for throughput and power efficiency. Central to this investigation is the deployment of a LLM, GPT-2, on the NPU using a bare-metal toolchain (IRON), achieving notable improvements in speed and energy efficiency.

Introduction

Increasing demand for localized ML capabilities has driven hardware innovations such as AMD's Ryzen AI processors featuring dedicated NPUs. These specialize in efficient ML performance without compromising power efficiency. Unlike general-purpose GPUs, NPUs streamline computations by targeting specific ML tasks, benefiting from specialized architectures like AMD's XDNA.

AMD XDNA Architecture

The AMD NPU's XDNA architecture is a spatial computing setup comprised of AI Engines arranged in a grid, each capable of executing tasks independently.

Figure 1: XDNA architecture showcasing compute, memory, and shim cores interconnected for optimal data routing.

XDNA facilitates AI Engine programmability through bare-metal access, empowering developers to customize implementations beyond what integrated libraries offer.

Methodology

Bare-Metal Programming using IRON

AMD's IRON toolchain grants programmers direct control over the NPU's operations, utilizing MLIR and Python for low-level hardware management.

Figure 2: IRON tool-flow detailing tool components and intermediate output stages.

By creating specialized matrix multiplication (GEMM) kernels within GPT-2's framework, the researchers were able to offload intensive computations onto the NPU, reducing the computational burden on the CPU.

Implementation

CPU and NPU Design

A hybrid strategy was employed, segregating computations between the CPU for general processing and the NPU for intensive matrix multiplications. This division minimizes NPU reconfiguration overheads, optimizing overall performance.

Figure 3: Data movement across memory levels, illustrating the interconnection of AI Engines.

Figure 4: Detailed transformations of matrix data layout to optimize DMA and VMAC operations.

Tiling dimensions were selected to optimize load balance and memory use, ensuring efficient parallel processing across NPU cores.

Evaluation

The implementation exhibited a significant speedup in matrix multiplication performance, with the NPU variant outperforming the CPU-only execution by over $2.8\times$ . The end-to-end application throughput achieved a $1.7\times$ speedup on mains power and $1.2\times$ on battery power, demonstrating both speed and energy efficiency.

Figure 5: Comparison of GEMM runtimes based on matrix sizes, illustrating the NPU's efficiency with larger computations.

Figure 6: Runtime breakdown for GEMM operations, highlighting the primary contributions of NPU kernel performance.

Conclusion

This exploration underscores the benefits of programmatically exploiting NPUs for client-side ML tasks. By utilizing bare-metal programming, the study achieved substantial performance gains, paving the way for more sophisticated application of NPUs in consumer devices. Future work could focus on extending the NPU's role beyond isolated computations to entire ML workflows for further efficiency improvements.

The findings highlight the significance of hardware specialization in advancing ML capabilities at the edge, setting a precedent for future developments in AI computing efficiency and customization.

Markdown