Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

86 tokens/sec

Gemini 2.5 Pro Premium

43 tokens/sec

GPT-5 Medium

19 tokens/sec

GPT-5 High Premium

30 tokens/sec

GPT-4o

93 tokens/sec

DeepSeek R1 via Azure Premium

88 tokens/sec

GPT OSS 120B via Groq Premium

441 tokens/sec

Kimi K2 via Groq Premium

234 tokens/sec

2000 character limit reached

UPMEM PiM System Overview

Updated 17 August 2025

UPMEM PiM System is a processing-in-memory architecture that integrates programmable DPUs within standard DDR4 DRAM to reduce data movement bottlenecks.
It employs a C-programmable, parallel SPMD model with explicit memory management and DMA operations, optimizing workloads like databases and neural network inference.
Benchmarking shows significant speedups (up to 93× over CPUs) and energy savings in streaming operations, while highlighting challenges in complex arithmetic and inter-DPU communication.

The UPMEM Processing-in-Memory (PiM) System is the first commercially available general-purpose PiM architecture realized as DRAM modules augmented with lightweight, programmable processing units within each DRAM chip. Unlike prior “processing-using-memory” proposals, which were largely confined to simulation or highly specialized analog/3D-stacked demonstrators, the UPMEM system implements an integrated, C-programmable, parallel architecture at the scale of standard DDR4 DRAM. Its principal purpose is to alleviate the memory bandwidth and data movement bottlenecks that dominate the execution time and energy consumption of modern memory-bound workloads, including databases, neural network inference, graph analytics, and more (Gómez-Luna et al., 2021, Gómez-Luna et al., 2021).

1. Architectural Principles and System Organization

The UPMEM PiM design tightly couples computation with commodity DRAM by embedding general-purpose “DRAM Processing Units” (DPUs) into each DRAM chip. Each DPU appears as a 32-bit RISC in-order processor with a deep 14-stage pipeline and supports fine-grained multithreading, with up to 24 hardware “tasklets” per core. Pipeline latency is best hidden with at least 11 tasklets per DPU.

The memory organization provides each DPU with exclusive ownership of a 64 MB MRAM bank, a 64 KB scratchpad (WRAM), and a 24 KB instruction RAM (IRAM). There are no caches; rather, explicit load/store and DMA-style primitives are used to access the memory hierarchy. The architecture forgoes any direct hardware communication between DPUs—any inter-DPU coordination (e.g., collective reductions) is handled via the host CPU by means of explicit serial or parallel transfers across MRAM banks.

Programming the DPUs involves a single-program-multiple-data (SPMD) paradigm. Kernel code, written in C and compiled using a dedicated LLVM-based toolchain, is offloaded to thousands of DPUs. Explicit SDK-level primitives permit data movement, synchronization, memory allocation, and inter-tasklet communication.

Component	Function/Spec	Notes
DPU core	32-bit, 14-stage in-order, C-programmable	24 tasklets, full throughput at ≥11
MRAM (per core)	64 MB bank (DRAM)	Direct host or DPU access, via DMA/manual load/store
WRAM (per core)	64 KB scratchpad	Single-cycle, used for fast computation
IRAM (per core)	24 KB	Program storage
Inter-DPU comm.	None (host-mediated only)	Limits scalability for synchronized workloads

This structural design enables massive internal DRAM bandwidth (measured sustained per-DPU bandwidths near theoretical peaks), but sharply limits per-core compute throughput due to the simple processor microarchitecture and lack of native support for complex arithmetic (multiplication/division/floating-point) (Gómez-Luna et al., 2021, Hyun et al., 2023).

2. Performance Characterization and Microarchitectural Limits

Extensive microbenchmarking reveals that the system is fundamentally compute-bound for almost all practical workloads. Streaming kernels performing 32-bit addition reach their throughput limit—approximately 58 million ops/sec at 350 MHz—using as few as 11 tasklets per DPU. More complex operations, such as 32/64-bit multiplication or floating point, are emulated in software and thus one order of magnitude slower due to the lack of corresponding hardware functional units.

The memory system, in contrast, is highly capable:

WRAM bandwidth: Up to 2.8 GB/s per DPU (measured with STREAM benchmarks).
MRAM bandwidth: Saturates at ~628 MB/s per DPU for large DMA transfers, limited by a simple parametric latency model:

$\text{Latency(cycles)} = \alpha + \beta \cdot \text{size(bytes)}$

with $\alpha \approx 77$ cycles, $\beta = 0.5$ cycles/byte.

Fundamental formulas describing achievable throughput are:

Arithmetic throughput:

$\text{Throughput} = \frac{f}{n}$

where $f$ is clock frequency and $n$ is instruction count per operation.

WRAM bandwidth:

$\text{BW}_{\text{WRAM}} = \frac{b \cdot f}{n}$

where $b$ is bytes per iteration.

The system thus excels at workloads with low operational intensity—i.e., few compute ops per byte moved—where the compute bottleneck is hidden by bandwidth (Gómez-Luna et al., 2021, Gómez-Luna et al., 2021).

3. Suitability, Programming Recommendations, and Benchmarking

UPMEM’s programming environment features explicit control over memory management and synchronization. Key recommendations derived from empirical findings are:

Launch ≥11 tasklets per DPU to hide pipeline stalls and maintain execution throughput.
Aggregate MRAM-to-WRAM DMA operations into large blocks where possible (to amortize setup cost).
Minimize any global inter-DPU communication; algorithms should be “embarrassingly parallel” to avoid scalability bottlenecks arising from host-mediated synchronization.
Favor workloads with regular data access and simple arithmetic; avoid complex math or frequent tasklet preemption, as these incur significant overhead.

To systematically evaluate real-world performance, the PrIM benchmark suite was developed, consisting of 16 programs across application domains:

Domain	Example Workloads
Dense/sparse algebra	GEMV, Vector Add, SpMV
Databases	Select, Unique
Data analytics	Binary Search, Time Series
Graph processing	BFS
Neural networks	MLP
Bioinformatics	Needleman-Wunsch
Image processing	Histogram
Parallel primitives	Reduction, Scan, Transpose

Benchmarks are chosen as memory-bound according to the roofline model, exhibiting low arithmetic intensity and poor data reuse, making them archetypal PiM targets (Gómez-Luna et al., 2021).

4. Comparative Analysis: CPU, GPU, and UPMEM PiM System

The UPMEM PiM platform has been systematically compared to modern processors:

Against CPUs: For 13 out of 16 PrIM benchmarks with simple streaming access and minimal inter-DPU communication, the 2,556‑DPU system achieves an average speedup of 23× and up to 93× relative to a high-end Intel Xeon.
Against GPUs: For 10 workloads characterized by streaming accesses, the PiM system is on average 2.54× faster than an NVIDIA Titan V.
Limitations: In workloads dominated by floating point computations (e.g., SpMV, NW alignment) or requiring dense global coordination (BFS), UPMEM’s performance falls behind; the absence of direct DPU-to-DPU communication and slow software-emulated math result in bottlenecks (Gómez-Luna et al., 2021, Gómez-Luna et al., 2021).

Energy measurements show PiM implementations also achieve significant energy reductions—up to 1.64× over CPUs—directly correlated with the reduction in main memory–to–CPU data movement.

5. Insights for Future System and Software Architects

UPMEM’s results yield several insights for architects and developers:

Compute-Bound Regime: Even minimal arithmetic intensity (as low as 0.25 op/byte) saturates the available DPU compute throughput, requiring future PiM designs to either augment DPU arithmetic capacity or better match compute/memory balance.
Instruction Set and Hardware Enhancements: Supporting complex operations (multiplication, division, float, SIMD) natively could improve performance on a broader range of workloads (Gómez-Luna et al., 2021, Gómez-Luna et al., 2021).
Scalability and Communication: The lack of native horizontal communication severely restricts scalability for workloads requiring global synchronization or data reduction; hardware-accelerated DPU-to-DPU channels are recommended.
Software/Algorithm Co-Design: Algorithms should be restructured to exploit distributed data and computation, using partitioning and programming patterns that minimize communication and data movement.
Memory/Tier Usage: Exploiting both the WRAM for hot/small data and MRAM for capacity is critical. DMA patterns and buffer reuse strategies should be carefully tuned.

Bottleneck/Insight	Mitigation in UPMEM	Recommendation for Future
Low compute throughput	≥11 tasklets/DPU	Add wide/complex ALUs
Expensive inter-DPU communication	Avoid, use host mediation	Direct DPU-to-DPU links
Limited instruction set	Software emulation	Hardware float/mult/div
Bandwidth over-provisioning	Exploit via streaming ops	Rebalance compute/mem

6. Broader Impact and Applications

By bringing computation to memory, the UPMEM PiM system enables substantial performance improvement in domains historically limited by bandwidth:

Large-scale neural network inference (e.g., MLP): Achieves up to 259× speedup in batch inference versus Intel Xeon baseline, on workloads sized to the available PiM memory (Carrinho et al., 10 Aug 2025).
Data analytics and DBMS: With the PIMDAL library, observed speedups for complex TPC-H queries are 3.9× on average relative to CPUs; key operations such as selection, aggregation, sorting, and joins benefit from memory proximity. The main constraints are explicit data transfers and limited communication between PIM units (Frouzakis et al., 2 Apr 2025).
Homomorphic encryption, graph analytics, and scientific computing similarly benefit where data movement dominates and the arithmetic mix is favorable.
Algorithmic and software toolchains (e.g., SimplePIM, DaPPA, ATiM) further lower the programming barrier and improve optimization, automating mapping, memory management, and code generation to exploit UPMEM strengths (Chen et al., 2023, Oliveira et al., 2023, Shin et al., 27 Dec 2024).

A plausible implication is that future PiM architectures will evolve toward greater generality, higher arithmetic richness, and improved inter-core communication, enabling the paradigm to subsume an even wider set of memory-intensive workloads as evidenced by observed workload-tailored speedups and insights from domain-benchmark suites (Gómez-Luna et al., 2021, Gómez-Luna et al., 2021, Shin et al., 27 Dec 2024).

7. Limitations and Open Challenges

While UPMEM demonstrates the feasibility and benefits of real-world PiM, intrinsic limitations remain:

The performance for complex arithmetic (multiplication, division) and floating-point operations is constrained by software emulation.
Inter-core communication is bottlenecked by the need to shuttle data through the host, making synchronized or collective operations expensive and limiting strong scaling for workloads with high coordination requirements.
Explicit programming of memory management (WRAM/MRAM transfers) and synchronization is still needed, though emergent software frameworks and autotuning compilers are beginning to lessen this burden.
Architectural trade-offs (e.g., scratchpad vs. cache, pipeline depth, thread scheduling) define the sweet spot for each workload category and motivate ongoing microarchitectural exploration (Hyun et al., 2023).

In sum, the UPMEM PiM system marks a major advance in deployable memory-centric computing, delivering strong performance and energy-efficiency results for a broad set of memory-bound workloads. The architectural insights and benchmarking methodology articulated in the core references provide practical guidance for optimizing both software and hardware for future generations of PiM-enabled systems (Gómez-Luna et al., 2021, Gómez-Luna et al., 2021).