PIMCOMP: An End-to-End DNN Compiler for Processing-In-Memory Accelerators (2411.09159v1)

Published 14 Nov 2024 in cs.AR

Abstract: Various processing-in-memory (PIM) accelerators based on various devices, micro-architectures, and interfaces have been proposed to accelerate deep neural networks (DNNs). How to deploy DNNs onto PIM-based accelerators is the key to explore PIM's high performance and energy efficiency. The scale of DNN models, the diversity of PIM accelerators, and the complexity of deployment are far beyond the human deployment capability. Hence, an automatic deployment methodology is indispensable. In this work, we propose PIMCOMP, an end-to-end DNN compiler tailored for PIM accelerators, achieving efficient deployment of DNN models on PIM hardware. PIMCOMP can adapt to various PIM architectures by using an abstract configurable PIM accelerator template with a set of pseudo-instructions, which is a high-level abstraction of the hardware's fundamental functionalities. Through a generic multi-level optimization framework, PIMCOMP realizes an end-to-end conversion from a high-level DNN description to pseudo-instructions, which can be further converted to specific hardware intrinsics/primitives. The compilation addresses two critical issues in PIM-accelerated inference from a system perspective: resource utilization and dataflow scheduling. PIMCOMP adopts a flexible unfolding format to reshape and partition convolutional layers, adopts a weight-layout guided computation-storage-mapping approach to enhance resource utilization, and balances the system's computation, memory access, and communication characteristics. For dataflow scheduling, we design two scheduling algorithms with different inter-layer pipeline granularities to support varying application scenarios while ensuring high computational parallelism. Experiments demonstrate that PIMCOMP improves throughput, latency, and energy efficiency across various architectures. PIMCOMP is open-sourced at \url{https://github.com/sunxt99/PIMCOMP-NN}.

Summary

The paper introduces PIMCOMP, an end-to-end DNN compiler that streamlines mapping complex neural networks onto varied processing-in-memory hardware.
It employs a genetic algorithm for optimal weight replication and dual-mode scheduling to balance high throughput and low latency requirements.
Experimental evaluations demonstrate improvements of up to 149.5x throughput, 21.8x lower latency, and 5.6x energy savings compared to baseline methods.

An Analysis of PIMCOMP: An End-to-End DNN Compiler for Processing-In-Memory Accelerators

The paper presents PIMCOMP, an end-to-end deep neural network (DNN) compiler specifically designed for processing-in-memory (PIM) accelerators. The authors address a pivotal challenge faced by modern computing architectures: the need to efficiently deploy DNNs on PIM-based systems. The diversity of existing PIM architectures and the ever-increasing scale and complexity of DNN models necessitate a sophisticated and automated deployment approach. This paper's contribution lies in introducing and evaluating an infrastructure that handles the efficient deployment of DNNs across diverse PIM hardware platforms.

Overview of PIMCOMP

PIMCOMP introduces a comprehensive framework that bridges high-level DNN descriptions to hardware-specific instructions tailored for PIM architectures. At its core, the work marries system-level optimization with a modular compilation design, divided into three main stages: layer partitioning, layout-computation mapping, and dataflow scheduling.

Layer Partitioning: PIMCOMP unveils the concept of array groups, which are the basic units that manage the partitioning of convolutional layers. This granularity facilitates flexible deployment and ensures efficient mapping across the crossbar arrays typical to PIM architectures.
Layout-Computation Mapping: Leveraging a genetic algorithm-based approach, PIMCOMP tackles weight replication by optimizing both the distribution of array groups (AGs) and computational task allocations. This method maximizes hardware resource utilization by determining weight layout configurations that suit varying hardware scales and scenarios.
Dataflow Scheduling: Two distinct modes, high-throughput (HT) and low-latency (LL), are introduced to address different application requirements. The HT mode targets scenarios demanding high throughput by processing layers sample by sample, while the LL mode aligns with environments necessitating quick inference responses by pipelining operations at the level of convolutional operators.

Numerical Results and Evaluation

PIMCOMP demonstrates pronounced improvements across different PIM architectures and networks when benchmarked against existing methods such as SongC and Polyhedral. The compiler exhibits substantial enhancement in throughput, latency, and energy efficiency.

Throughput: The compiler achieves significant throughput increases, with up to 149.5 times improvement over some traditional methods. The advancements stem from its fine-grained operator-level scheduling and the integration of flexible unfolding formats that optimize both memory access and reuse.
Latency and Energy Efficiency: In the low-latency scenario, PIMCOMP reduces inference latency significantly, achieving up to 21.8 times improvement. Energy efficiency benefits from reduced inference time due to balanced computation-resource leverage and data flow optimization, with energy savings as high as 5.6 times compared to baselines.

These results underline PIMCOMP's competence in harnessing PIM's in-situ computing capacity, elevating both temporal and resource efficiency metrics.

Implications and Future Directions

PIMCOMP's impact is underscored by its alignment with future AI and hardware integration trends. By abstracting the underlying diverse architectures with a unifying compiler strategy, PIMCOMP not only addresses current computational bottlenecks in AI deployment but also lays foundational work for the continued adoption of PIM technologies in various real-world applications. Its adaptable, open-source architecture offers substantial utility for both academic research and industrial applications seeking optimized AI workload deployment.

In the long run, the exploration of mixed-precision and cross-layer optimizations could further enhance PIMCOMP's efficacy. Additionally, addressing broader machine learning workloads beyond DNNs would expand its application scope, making it indispensable in a diverse array of AI scenarios. Continuing to refine the architecture to minimize compilation time while achieving optimal execution could also provide significant operational benefits. As such, PIMCOMP stands as a testament to the ongoing synergy between AI model development and hardware innovations, paving the way for future advancements in computational efficiency and capability.

PDF Markdown

Related Papers

GitHub

GitHub - sunxt99/PIMCOMP-NN (59 stars)

Tweets

https://twitter.com/WWVY/status/1857345913238212672