Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture (2105.03814v7)

Published 9 May 2021 in cs.AR, cs.DC, and cs.PF

Abstract: Many modern workloads, such as neural networks, databases, and graph processing, are fundamentally memory-bound. For such workloads, the data movement between main memory and CPU cores imposes a significant overhead in terms of both latency and energy. A major reason is that this communication happens through a narrow bus with high latency and limited bandwidth, and the low data reuse in memory-bound workloads is insufficient to amortize the cost of main memory access. Fundamentally addressing this data movement bottleneck requires a paradigm where the memory system assumes an active role in computing by integrating processing capabilities. This paradigm is known as processing-in-memory (PIM). Recent research explores different forms of PIM architectures, motivated by the emergence of new 3D-stacked memory technologies that integrate memory with a logic layer where processing elements can be easily placed. Past works evaluate these architectures in simulation or, at best, with simplified hardware prototypes. In contrast, the UPMEM company has designed and manufactured the first publicly-available real-world PIM architecture. This paper provides the first comprehensive analysis of the first publicly-available real-world PIM architecture. We make two key contributions. First, we conduct an experimental characterization of the UPMEM-based PIM system using microbenchmarks to assess various architecture limits such as compute throughput and memory bandwidth, yielding new insights. Second, we present PrIM, a benchmark suite of 16 workloads from different application domains (e.g., linear algebra, databases, graph processing, neural networks, bioinformatics).

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Juan Gómez-Luna (57 papers)
  2. Izzat El Hajj (17 papers)
  3. Ivan Fernandez (13 papers)
  4. Christina Giannoula (24 papers)
  5. Geraldo F. Oliveira (38 papers)
  6. Onur Mutlu (279 papers)
Citations (74)

Summary

An Expert Analysis of UPMEM's Processing-In-Memory Architecture

The paper provides a comprehensive overview of the UPMEM processing-in-memory (PIM) architecture, which represents a significant step towards implementing PIM systems with commercially available hardware. It evaluates the architecture, presents a suite of benchmarks tailored for PIM (PrIM), and compares UPMEM's performance and energy efficiency against modern CPUs and GPUs.

Architecture and Evaluation

The UPMEM PIM system integrates processing capabilities directly within memory, using DRAM Processing Units (DPUs) embedded in conventional 2D DRAM technology. The architecture presents a solution to bypass the traditional data movement bottlenecks between memory and CPU cores by performing computations directly in memory. The system features multiple DPUs per DRAM chip, each possessing its isolated instruction and working RAM, allowing for concurrent execution of up to 24 tasklets.

This architecture is theoretically capable of significantly reducing latency and energy consumption for memory-bound workloads. However, the design decision to employ in-order cores with limited arithmetic units suggests the architecture's best fit lies with tasks requiring simple arithmetic routines. The paper provides empirical evidence through a microbenchmark analysis demonstrating that while the UPMEM system can achieve high bandwidth for memory accesses, it is fundamentally compute-bound due to the lack of native support for complex operations beyond simple integer arithmetic.

Benchmark Suite Analysis

The paper introduces PrIM, a suite of 16 benchmarks spanning diverse domains from dense/sparse linear algebra to graph processing and bioinformatics. This benchmark suite assesses the suitability of the UPMEM system for varied workload characteristics, particularly focusing on memory access and synchronization patterns. The findings suggest that workloads falling within the suite's definition—those that balance memory bandwidth utilization effectively with limited computational demands—are those that will typically benefit most from the UPMEM's architecture.

Performance Comparisons and Future Directions

Comparative analysis with modern CPU and GPU systems indicates that UPMEM outperforms CPUs across many categories, achieving substantial improvements in energy efficiency due to reduced data movement. However, against GPUs, its performance advantage is confined to specific types of workloads characterized by low computational intensity and minimal inter-DPU communication demands.

For future architectures, the authors suggest enhancing direct communication between DPUs, incorporating more sophisticated arithmetic units capable of handling complex operations natively, and effectively utilizing the memory hierarchy to support diverse application types. Further software optimizations and refined libraries for common operations could supplement this hardware development.

Conclusion

In essence, the UPMEM PIM system signifies a meaningful progress toward realizing efficient, scalable, and memory-centric processing architectures. While competitive limitations exist, particularly in comparison to GPUs for computation-heavy tasks, the architecture's promise is evident for memory-bound applications. The findings and benchmark suite presented within this comprehensive evaluation lay a foundation for future developments in PIM technologies, potentially influencing a shift toward memory-centric computing paradigms.

Youtube Logo Streamline Icon: https://streamlinehq.com