SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems (2201.05072v4)

Published 13 Jan 2022 in cs.AR, cs.DC, and cs.PF

Abstract: Several manufacturers have already started to commercialize near-bank Processing-In-Memory (PIM) architectures. Near-bank PIM architectures place simple cores close to DRAM banks and can yield significant performance and energy improvements in parallel applications by alleviating data access costs. Real PIM systems can provide high levels of parallelism, large aggregate memory bandwidth and low memory access latency, thereby being a good fit to accelerate the widely-used, memory-bound Sparse Matrix Vector Multiplication (SpMV) kernel. This paper provides the first comprehensive analysis of SpMV on a real-world PIM architecture, and presents SparseP, the first SpMV library for real PIM architectures. We make three key contributions. First, we implement a wide variety of software strategies on SpMV for a multithreaded PIM core and characterize the computational limits of a single multithreaded PIM core. Second, we design various load balancing schemes across multiple PIM cores, and two types of data partitioning techniques to execute SpMV on thousands of PIM cores: (1) 1D-partitioned kernels to perform the complete SpMV computation only using PIM cores, and (2) 2D-partitioned kernels to strive a balance between computation and data transfer costs to PIM-enabled memory. Third, we compare SpMV execution on a real-world PIM system with 2528 PIM cores to state-of-the-art CPU and GPU systems to study the performance and energy efficiency of various devices. SparseP software package provides 25 SpMV kernels for real PIM systems supporting the four most widely used compressed matrix formats, and a wide range of data types. Our extensive evaluation provides new insights and recommendations for software designers and hardware architects to efficiently accelerate SpMV on real PIM systems.

Authors (6)

Christina Giannoula (24 papers)
Ivan Fernandez (13 papers)
Juan Gómez-Luna (57 papers)
Nectarios Koziris (18 papers)
Georgios Goumas (14 papers)
Onur Mutlu (279 papers)

Citations (23)

View on Semantic Scholar

Summary

The paper introduces SparseP, a novel library that accelerates sparse matrix-vector multiplication on real PIM systems.
It details innovative load balancing and data partitioning methods using 1D and 2D schemes to mitigate data movement bottlenecks.
The comparative analysis demonstrates that PIM systems can outperform traditional CPUs and GPUs in efficiency and scalability.

SparseP: Efficiency in Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems

The research paper titled "SparseP: Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems" presents a comprehensive analysis and development of SparseP, a specialized software library that enhances the execution of Sparse Matrix Vector Multiplication (SpMV) on state-of-the-art Processing-In-Memory (PIM) architectures. This work is significant as it taps into the potential of PIM systems, which have been increasingly recognized for their capability to mitigate the challenges of data movement bottlenecks inherent in traditional Von Neumann architectures, especially for memory-bound computations like SpMV.

Key Contributions

The paper makes the following critical contributions to the field of computing and architecture:

Comprehensive SpMV Implementations: Various software strategies are implemented to optimize SpMV on PIM systems. These include different compressed matrix formats, load balancing schemes across both multiple PIM cores and threads within a PIM core, as well as synchronization approaches. This extensive exploration allows for a deep understanding of the computational limits in using a single multithreaded PIM core.
Load Balancing and Data Partitioning: The authors propose diverse load balancing schemes across multiple PIM cores, which include one-dimensional (1D) and two-dimensional (2D) data partitioning techniques. These approaches aim to strike a balance between computation and data transfer costs, providing insights into executing SpMV on thousands of PIM cores efficiently.
Comparative Analysis: The paper involves a meticulous performance and energy-efficiency comparison of SpMV's execution on a real-world PIM system with conventional CPU and GPU platforms. This comparative analysis is imperative for proposing these PIM systems as viable alternatives for specific applications reliant on SpMV.
Public Release of SparseP: By making SparseP available to the public, the research supports ongoing and future efforts in exploring PIM systems' applicability to various computational challenges beyond SpMV.

Numerical Results and Findings

In their evaluation, the authors used 26 matrices with diverse sparsity patterns to derive new insights into the efficient acceleration of the SpMV kernel. The evaluation on platforms such as Intel Xeon CPUs, NVIDIA Tesla GPUs, and the real-world UPMEM PIM system provides quantitative evidence of the advantages that PIM systems offer, particularly in achieving higher fractions of maximum theoretical performance.

Key findings include the observation that 1D-partitioned SpMV execution on PIM can achieve scalability limitations due to the data-intensive nature of broadcasting input vectors. Conversely, 2D-partitioned kernels can manage data transfer overheads better but require fine-tuned partitioning strategies. The comparison reveals that PIM systems can significantly outperform traditional CPU and GPU systems in terms of machine's peak performance leverage for the SpMV kernel, underscoring the capability of PIM systems in mitigating data movement bottlenecks effectively.

Implications and Future Work

The implications of this research extend both practically and theoretically. Practically, it implies the potential restructuring of workloads traditionally run on CPUs and GPUs to be executed on PIM systems, particularly those that are memory-bound and involve sparse computations. Theoretically, this paper serves to support future design efforts in computer architecture, encouraging the development of systems that integrate processing closer to memory to enhance performance and energy efficiency.

The research opens pathways for future development of adaptive algorithms that dynamically adjust partitioning and load balancing based on real-time insights into matrix characteristics and PIM architecture capabilities. Additionally, exploring tighter integration between PIM systems and existing computing infrastructure could further propel the adoption of these systems in practice.

In conclusion, the SparseP library and its rigorous evaluation present an essential advance in leveraging Processing-In-Memory architectures for efficient execution of sparse matrix operations, providing a foundation for both future architectural innovations and software optimizations in the domain.

PDF Markdown

Related Papers

YouTube

Show All Videos