PUMA: A Programmable Ultra-efficient Memristor-based Accelerator for Machine Learning Inference (1901.10351v2)

Published 29 Jan 2019 in cs.ET and cs.AR

Abstract: Memristor crossbars are circuits capable of performing analog matrix-vector multiplications, overcoming the fundamental energy efficiency limitations of digital logic. They have been shown to be effective in special-purpose accelerators for a limited set of neural network applications. We present the Programmable Ultra-efficient Memristor-based Accelerator (PUMA) which enhances memristor crossbars with general purpose execution units to enable the acceleration of a wide variety of Machine Learning (ML) inference workloads. PUMA's microarchitecture techniques exposed through a specialized Instruction Set Architecture (ISA) retain the efficiency of in-memory computing and analog circuitry, without compromising programmability. We also present the PUMA compiler which translates high-level code to PUMA ISA. The compiler partitions the computational graph and optimizes instruction scheduling and register allocation to generate code for large and complex workloads to run on thousands of spatial cores. We have developed a detailed architecture simulator that incorporates the functionality, timing, and power models of PUMA's components to evaluate performance and energy consumption. A PUMA accelerator running at 1 GHz can reach area and power efficiency of $577~GOPS/s/mm^2$ and $837~GOPS/s/W$, respectively. Our evaluation of diverse ML applications from image recognition, machine translation, and LLMling (5M-800M synapses) shows that PUMA achieves up to $2,446\times$ energy and $66\times$ latency improvement for inference compared to state-of-the-art GPUs. Compared to an application-specific memristor-based accelerator, PUMA incurs small energy overheads at similar inference latency and added programmability.

Citations (358)

View on Semantic Scholar

Summary

The paper introduces a novel architecture that merges analog in-memory computing with programmable digital units for versatile ML inference.
It features a custom ISA and compilation framework that optimizes workload partitioning and instruction scheduling across thousands of cores.
Empirical evaluations demonstrate up to 2,446x energy efficiency and 66x latency improvements over GPUs, promising practical benefits for energy-constrained environments.

Overview of PUMA: A Programmable Ultra-efficient Memristor-based Accelerator for Machine Learning Inference

The paper presents PUMA, a Programmable Ultra-efficient Memristor-based Accelerator, designed for machine learning inference. It exploits memristor crossbars' inherent analog matrix-vector multiplication capabilities to overcome energy efficiency limitations associated with digital logic. PUMA enhances these crossbars with general-purpose execution units to create a versatile accelerator capable of handling a wide array of ML inference workloads.

Key Contributions

Architecture Design: PUMA integrates a specialized Instruction Set Architecture (ISA) to retain the efficiency of in-memory computing and analog circuitry while maintaining programmability. The architecture employs microarchitecture techniques that support different machine learning applications, thereby diversifying beyond the limited scope typical of memristor-based accelerators previously designed for specific neural network applications.
Compilation: A comprehensive compilation framework is essential for PUMA to convert high-level code to its specialized ISA. The compiler effectively partitions computational graphs and optimizes instruction scheduling and register allocation, ensuring that complex workloads can be efficiently managed across thousands of spatial cores.
Simulation and Evaluation: The authors developed a detailed simulator that captures the functionality, timing, and power aspects of PUMA's components. Empirical evaluations highlight PUMA's capability to deliver substantial improvements in energy efficiency (up to 2,446x) and latency (up to 66x) over conventional GPUs. This efficiency does not come at the expense of programmability, unlike application-specific accelerators.
Performance Metrics: With its architecture operating at 1 GHz, PUMA achieves area and power efficiency metrics of 577 GOPS/s/mm² and 837 GOPS/s/W, respectively. These metrics demonstrate PUMA's potency compared to other ML accelerators, such as Google's TPU, in terms of operations per second per area and per watt.

Implications and Future Prospects

Practical Implications: The architecture's ability to efficiently handle a range of ML tasks while minimizing energy consumption and latency positions it as a potential game-changer for deploying ML inference in energy-constrained environments, such as edge devices and IoT. The significant reduction in data movement costs due to in-memory operations is pertinent for workloads that are traditionally memory-bound.

Theoretical Implications: PUMA demonstrates that hybrid analog-digital designs can outperform purely digital systems for specific computational tasks, challenging the prevailing rationale of broad-spectrum digital computation. It encourages rethinking data-centric operations in light of the potential benefits of near-memory and in-memory computing models.

Speculation for AI Development: As AI workloads continue to ascend in complexity and scale, architectures like PUMA could inspire further exploration into analog computing's role in AI hardware, particularly in areas requiring high parallelism and efficiency. Additionally, its open-source simulator and compiler could seed new research into hybrid inference accelerators, fostering innovation in energy-efficient AI.

Conclusion: PUMA builds on the promise of memristor technology to deliver a versatile, powerful, and energy-efficient platform for ML inference, highlighting the capabilities of hybrid computing architectures. By coupling analog crossbars with a programmable architecture, PUMA manages to offer efficiencies that may help set a new direction for AI hardware research, advocating further investigation into the utilization of emerging memory technologies within general purpose computational settings.

PDF Markdown

Related Papers

YouTube

Show All Videos