Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems (2204.00900v1)

Published 2 Apr 2022 in cs.AR, cs.DC, and cs.PF

Abstract: Several manufacturers have already started to commercialize near-bank Processing-In-Memory (PIM) architectures. Near-bank PIM architectures place simple cores close to DRAM banks and can yield significant performance and energy improvements in parallel applications by alleviating data access costs. Real PIM systems can provide high levels of parallelism, large aggregate memory bandwidth and low memory access latency, thereby being a good fit to accelerate the widely-used, memory-bound Sparse Matrix Vector Multiplication (SpMV) kernel. This paper provides the first comprehensive analysis of SpMV on a real-world PIM architecture, and presents SparseP, the first SpMV library for real PIM architectures. We make two key contributions. First, we design efficient SpMV algorithms to accelerate the SpMV kernel in current and future PIM systems, while covering a wide variety of sparse matrices with diverse sparsity patterns. Second, we provide the first comprehensive analysis of SpMV on a real PIM architecture. Specifically, we conduct our rigorous experimental analysis of SpMV kernels in the UPMEM PIM system, the first publicly-available real-world PIM architecture. Our extensive evaluation provides new insights and recommendations for software designers and hardware architects to efficiently accelerate the SpMV kernel on real PIM systems. For more information about our thorough characterization on the SpMV PIM execution, results, insights and the open-source SparseP software package [26], we refer the reader to the full version of the paper [3, 4]. The SparseP software package is publicly and freely available at https://github.com/CMU-SAFARI/SparseP.

Authors (6)

Christina Giannoula (24 papers)
Ivan Fernandez (13 papers)
Juan Gómez-Luna (57 papers)
Nectarios Koziris (18 papers)
Georgios Goumas (14 papers)
Onur Mutlu (279 papers)

Citations (2)

View on Semantic Scholar

Summary

Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems

This paper titled "Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Systems" addresses the execution efficiency of Sparse Matrix Vector Multiplication (SpMV) on Real Processing-In-Memory (PIM) systems, specifically focusing on the UPMEM PIM architecture. SpMV is pivotal for a variety of computational tasks in scientific computing, machine learning, and graph analytics, and is characterized by irregular memory access patterns due to its reliance on compressed formats of sparse matrices. The paper contributes novel strategies for optimizing SpMV execution in the context of PIM architectures, which promise to alleviate the data movement bottleneck inherent in traditional processor-centric systems.

The authors present two significant contributions: first, the design of efficient SpMV algorithms compatible with both existing and prospective PIM systems; and second, a comprehensive performance analysis of SpMV on an actual PIM system. This analysis is built around the authors' development of the SparseP library, which includes 25 SpMV kernels categorized by popular matrix formats and data types, alongside efficient partitioning and load balancing strategies tailored for PIM-enabled memory.

Key Findings and Recommendations

Load Balancing and Synchronization: A critical factor in optimizing SpMV performance on PIM systems is effective load balancing across PIM cores and threads. Poor load balance in terms of non-zero elements or memory accesses across threads leads to degraded performance. Furthermore, the research identifies that current granular locking methods do not enhance performance due to serialization of concurrent DRAM accesses.
Data Structure Design: The sparse matrix's compressed format directly influences data partitioning and thus the load balance across PIM cores. The authors advocate for adaptive algorithms that accommodate varying input patterns and PIM hardware characteristics, adjusting the trade-off between computation and data transfer efficiency.
Hardware and System Suggestions: The research strongly recommends enhancements in PIM hardware to support better synchronization, optimized data transfer operations, and faster communication channels. These enhancements are essential to address the data transfer bottlenecks identified in the PIM systems, which currently limit the potential gains from high parallelism.

Implications and Future Work

The paper provides substantial insights into optimizing memory-bound computations such as SpMV, with broader implications for enhancing efficiency in other irregular computing domains using PIM systems. It suggests substantial hardware enhancements, including improving synchronization schemes and increasing DRAM bank capabilities, to exploit the parallelism PIM systems offer fully.

The findings can significantly inform software developers in designing more efficient sparse linear algebra kernels. Meanwhile, hardware architects are encouraged to consider these insights when developing future memory-centric computing systems. As real PIM systems and their ecosystems mature, the optimization strategies validated in this paper could guide the development of advanced architectures and algorithms aimed at higher energy efficiency and performance scalability.

Conclusion

By integrating novel algorithmic strategies for data distribution with detailed hardware recommendations, the paper not only advances the state of knowledge on SpMV in PIM environments but also sets a foundation for future explorations into memory-centric computational paradigms. The open-source release of the SparseP library further facilitates ongoing research and development in this field, promoting broader adoption and experimentation in real-world PIM contexts.

PDF Markdown

Related Papers

YouTube

Show All Videos