PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices (2404.08871v1)

Published 13 Apr 2024 in cs.DC and cs.AR

Abstract: Recent dual in-line memory modules (DIMMs) are starting to support processing-in-memory (PIM) by associating their memory banks with processing elements (PEs), allowing applications to overcome the data movement bottleneck by offloading memory-intensive operations to the PEs. Many highly parallel applications have been shown to benefit from these PIM-enabled DIMMs, but further speedup is often limited by the huge overhead of inter-PE communication. This mainly comes from the slow CPU-mediated inter-PE communication methods which incurs significant performance overheads, making it difficult for PIM-enabled DIMMs to accelerate a wider range of applications. Prior studies have tried to alleviate the communication bottleneck, but they lack enough flexibility and performance to be used for a wide range of applications. In this paper, we present PID-Comm, a fast and flexible collective inter-PE communication framework for commodity PIM-enabled DIMMs. The key idea of PID-Comm is to abstract the PEs as a multi-dimensional hypercube and allow multiple instances of collective inter-PE communication between the PEs belonging to certain dimensions of the hypercube. Leveraging this abstraction, PID-Comm first defines eight collective inter-PE communication patterns that allow applications to easily express their complex communication patterns. Then, PID-Comm provides high-performance implementations of the collective inter-PE communication patterns optimized for the DIMMs. Our evaluation using 16 UPMEM DIMMs and representative parallel algorithms shows that PID-Comm greatly improves the performance by up to 4.20x compared to the existing inter-PE communication implementations. The implementation of PID-Comm is available at https://github.com/AIS-SNU/PID-Comm.

Citations (3)

View on Semantic Scholar

Summary

The paper presents a novel virtual hypercube model that enables user-defined and multi-instance collective communication among processing elements.
It leverages techniques such as PE-assisted reordering and in-register modulation to minimize CPU overhead and reduce host memory access delays.
Evaluations across benchmarks show up to 5.19× speedup in microbenchmarks and a 1.99× geometric mean improvement in real-world applications.

PID-Comm: A Collective Communication Framework for Processing-in-DIMM Devices

The paper presents PID-Comm, an innovative framework designed to address the inefficiencies of inter-PE communication in PIM-enabled DIMMs. Recent advances in PIM technology have made processing-in-DIMM devices a compelling solution for memory-intensive applications; however, the benefits of such devices are heavily constrained by the high overhead of inter-PE communications, primarily mediated by the CPU.

Core Contributions

The paper posits that commodity PIM-enabled DIMMs inherently suffer from performance limitations due to the lack of direct inter-PE communication paths and the inefficient use of the host CPU for data transfer. To address these issues, PID-Comm introduces a fast and flexible inter-PE collective communication framework leveraging a multi-dimensional virtual hypercube abstraction. By defining a novel communication model and integrating optimized algorithms, PID-Comm significantly mitigates the bottlenecks of traditional designs.

Key Architectural Features

Virtual Hypercube Communication Model:
- User-defined Hypercube: Users define a hypercube configuration where each dimension's length can be set to a power-of-two value, providing flexibility for a variety of applications.
- Cube Slices: This concept abstracts multiple communication groups based on selected dimensions, enabling multiple instances of collective communication.
- Multi-instance Invocation: Users can invoke multiple communication instances in parallel, optimizing data exchange patterns across different dimensions.
Performance Optimization Techniques:
- PE-assisted Reordering: This technique reduces the host CPU's computational burden by allowing PEs to partially reorder data before and after host interaction.
- In-register Modulation: Host data modulation is confined within CPU vector registers, eliminating the need for time-consuming host memory access.
- Cross-domain Modulation: This eliminates domain transfers for non-arithmetic operations such as AlltoAll and AllGather, substantially reducing execution time.

Evaluation and Results

The framework was rigorously evaluated using 16 UPMEM DIMMs across a variety of benchmarks, including DLRM, GNNs, BFS, CC, and MLP. The performance improvements were significant:

Microbenchmarks: PID-Comm achieved up to 5.19× speedup in AlltoAll and 4.46× in ReduceScatter for collective communication primitives.
Applications: Real-world applications saw notable speedup, ranging from 1.20× to 3.99×, with a geometric mean improvement of 1.99×, illustrating the practical efficacy of PID-Comm.

Detailed Analysis

The paper explores a detailed breakdown of execution times and the benefits of the progressive optimizations. PE-assisted reordering was particularly effective in balancing the computational load between the host and PEs, whereas in-register and cross-domain modulation targeted the elimination of host memory access and domain transfer overheads, respectively.

Implications and Future Directions

The framework contributes both practically and theoretically by demonstrating a software-based solution that makes PIM-enabled DIMMs viable for a wide range of applications without the need for hardware modifications. Future research could explore further integration with hardware accelerators like Intel's DSA to offload the computational burden from the host CPU or explore extending PID-Comm's principles to other PIM architectures such as HBM-PIM and AxDIMM.

In conclusion, PID-Comm stands as a significant advancement in the domain of PIM-enabled systems, presenting a carefully designed communication model that bridges the performance gap caused by inter-PE communication overheads. By introducing a flexible, multi-dimensional hypercube communication model and sophisticated optimization techniques, the framework paves the way for more efficient and scalable PIM-based memory systems, highlighting the potential for future research in optimizing memory-intensive applications.

PDF Markdown

Related Papers

Tweets

https://twitter.com/Underfox3/status/1780248931969937419

https://twitter.com/HPCPapers/status/1780114570805493974