UPMEM System: In-Memory PIM Architecture
- UPMEM System is a commercially available processing-in-memory (PIM) architecture that integrates thousands of lightweight DPUs into DDR4 modules.
- It addresses memory bandwidth bottlenecks by colocating computation within DRAM to accelerate workloads like machine learning and database analytics.
- The system enables scalable parallelism, improved energy efficiency, and flexible programming via frameworks such as ATiM, DaPPA, and SimplePIM.
The UPMEM system is the first commercially available general-purpose processing-in-memory (PIM) architecture, integrating thousands of lightweight processor cores (DRAM Processing Units, DPUs) directly within commodity DDR4 DRAM modules. The architecture is designed to address the memory bandwidth bottlenecks that hamper traditional processor-centric systems, especially for memory-bound workloads such as machine learning, database analytics, and large-scale graph processing. By colocating computation with data storage inside DRAM, UPMEM enables high aggregate memory bandwidth, reduced energy consumption for data movement, and scalable parallelism on workloads characterized by low arithmetic intensity and high data transfer volumes.
1. Hardware Architecture and Memory Hierarchy
UPMEM’s fundamental unit is a PIM-enabled DDR4 DIMM, which can be plugged into standard memory slots alongside or in place of conventional DRAM modules. Each DIMM contains multiple DRAM-PIM chips (typically 8 or 16), and every chip embeds eight independent DPU cores. At full configuration, a server can host up to 20 PIM DIMMs, yielding as many as 2,560 DPUs and 160 GB of natively addressable DRAM for in-memory compute (Gómez-Luna et al., 2021, Shin et al., 2024, Oliveira et al., 2023, Carrinho et al., 10 Aug 2025).
Each DPU is a 32-bit in-order RISC processor, clocked between 350 and 450 MHz on recent hardware revisions, with a 14-stage fine-grained multithreaded pipeline and up to 24 hardware threads (tasklets) sharing a modest register file. DPUs are tightly coupled with:
- WRAM (Working SRAM): 64 KB per DPU, acting as a software-managed scratchpad for high-speed, low-latency computation.
- IRAM (Instruction RAM): 24 KB per DPU, stores instructions and constants.
- MRAM (PIM-DRAM): 64 MB per DPU, main DRAM bank directly accessible by both the host and the DPU via explicit DMA moves.
The memory hierarchy is flat within each DPU, with MRAM-WRAM transfers orchestrated by a lightweight DMA engine at a per-transfer latency of cycles, where practical bandwidths per DPU typically range near 600–700 MB/s. Parallel DMA achieves system-wide bandwidths in excess of 1 TB/s (Frouzakis et al., 2 Apr 2025, Gómez-Luna et al., 2021).
There is no direct hardware interconnect among DPUs; any inter-DPU exchange requires round-tripping data through the host, which orchestrates data distribution, kernel launches, and synchronizations.
2. Programming Model and Software Stack
The UPMEM programming model is SPMD (single-program multiple-data), operating similarly to GPU or distributed memory programming but at DRAM-native granularity. Host applications are written in C/C++ and use the UPMEM SDK to manage:
- Allocation and initialization of DPU groups (via
dpu_alloc,dpu_load). - Explicit data movement into each DPU's MRAM (
dpu_copy_to,dpu_push_xfer). - Launch of parallel DPU kernels and synchronization (
dpu_launch,dpu_sync). - Result collection from DPUs to host DRAM (
dpu_copy_from).
Each DPU runs an SPMD C kernel compiled with Clang/LLVM extensions. Kernels can spawn up to 24 tasklets, which cooperate using the shared WRAM and synchronize via lightweight barriers and mutexes. Data must be explicitly DMA’d between MRAM and WRAM; there are no hardware-managed caches or automatic streaming mechanisms (Carrinho et al., 10 Aug 2025, Oliveira et al., 2023, Shin et al., 2024).
To simplify programming, several frameworks have emerged:
- ATiM: An autotuning tensor compiler for UPMEM that integrates code generation into TVM/TensorIR, providing high-level scheduling, PIM-specific optimizations (DMA-aware boundary elimination, loop-bound tightening), and tuning primitives for both host and DPU schedules (Shin et al., 2024).
- DaPPA: A data-parallel pattern framework exposing higher-level C++ templates (map, reduce, filter, etc.), dynamic pipeline composition, and code generation via templates and stage-dependent C++ AST analysis (Oliveira et al., 2023).
- SimplePIM: Array-based iterators for map/reduce/zip, high-level collective primitives (scatter/gather/allreduce), and automated management of alignment/tranfer/threading (Chen et al., 2023).
These frameworks reduce programmer effort by as much as 94% in lines of code, and on average deliver performance at or above that of hand-tuned code across representative workloads.
3. Workload Mapping, Data Partitioning, and Tiling Strategies
Optimal use of UPMEM requires careful partitioning of global data and computation, as the absence of hardware caching, limited scratchpad size (64 KB WRAM), and explicit MRAM-WRAM-DPU data flows demand precise control:
- Data partitioning: Tasks are divided across DPUs; for a vector of elements, each DPU processes roughly elements. For matrices (e.g., neural network layers), tiled 2D blocking strategies distribute blocks, trading off replication ratio vs. synchronization.
- Replication ratio formula:
- Minimizing data movement: Input matrices may be transposed on the host so block transfers are contiguous 8-byte aligned, maximizing DMA efficiency (Carrinho et al., 10 Aug 2025).
Kernels are mapped so the bulk of computation occurs within WRAM, and only final results move back to the host, drastically reducing host↔MRAM traffic. For matrix-matrix and matrix-vector multiplies, hand-tuning and/or autotuning frameworks search for block, tile, and reduction sizes that maximize utilization and minimize synchronization overhead.
4. Application Domains and Performance Characterization
UPMEM’s architecture is well-suited for memory-bound workloads with high bandwidth and low arithmetic complexity requirements. Key domains include:
- Machine learning inference and training: For fully-connected neural network layers, in-memory GEMV/GEMM, multilayer perceptrons, and distributed SGD variants, UPMEM demonstrates up to speedup over an Intel Xeon CPU (VGG-scale MLP inference) and matches or outperforms GPUs in bandwidth-bound regimes (especially with memory oversubscription scenarios on the GPU) (Carrinho et al., 10 Aug 2025, Oliveira et al., 2022, Rhyner et al., 2024).
- Data analytics and relational DBMS logic: PIMDAL shows mean performance over a high-end CPU for TPC-H queries, with selection, aggregation, sort, and join operators carefully engineered to maximize DMA size, thread-level parallelism (≥11 tasklets per DPU), and offload communication logic to the host (Frouzakis et al., 2 Apr 2025).
- Graph analytics: Triangle counting, BFS, and approximate analytics leverage partitioning and sampling to minimize host↔DPU traffic. PIM outperforms the CPU on dynamic workloads—updates on edge lists (COO format) impose negligible shuffle overhead for PIM compared to CSR-based CPU methods (Asquini et al., 7 May 2025).
- Homomorphic encryption and number-theoretic workloads: By tiling polynomials across DPUs and using residue number systems, DRAMatic demonstrates substantial speedups and energy reductions for NTT and BGV-mult kernels, with residual bottlenecks in (software) multiplication (Klinger et al., 12 Feb 2026).
- Spatial query processing: R-tree range queries, which are not streaming-friendly, are mapped by CPU-side level-broadcast and low-level leaf partitioning, achieving up to – energy reduction and 0 kernel speedup over CPU baselines (Jannat et al., 15 Apr 2026).
A summary of representative UPMEM workload mappings and peak observed speedups:
| Workload | Speedup UPMEM vs CPU | Notes |
|---|---|---|
| MLP inference (VGG) | 1 | 2 DPUs, batch inference |
| TPC-H Q1 selection/aggr | 3 | vs. host Xeon CPU |
| Triangle counting, dyn. | 4 | WikipediaEdit, dynamic COO edge streams |
| NTT (HE) | 5–6 | vs. CPU for vector add (polys) |
| R-tree spatial range | 7–8 | kernel vs. CPU kernel, energy reduction |
End-to-end performance is typically bottlenecked by data movement (host↔MRAM, MRAM↔WRAM), as explicit memory management requires loading data to scratchpad and orchestrating output staging.
5. Limitations, Scalability, and Bottlenecks
Despite its highly parallel and bandwidth-centric design, the UPMEM system is constrained by several architectural limits:
- No native hardware support for multiply/divide, FP: All non-add/sub integer and floating-point operations are emulated in software (tens–hundreds of cycles latency), which sharply curtails performance on computation-heavy kernels. For example, in MLP inference, software-emulated FP32 multiply increases kernel latency; on homomorphic encryption, lack of native 32-bit multiplier results in UPMEM being 9–0 slower than optimized GPUs on multiplication-dominated workloads (Carrinho et al., 10 Aug 2025, Gupta et al., 2023, Klinger et al., 12 Feb 2026).
- Scratchpad size (WRAM): With only 64 KB per DPU, holding large working sets entirely in scratchpad is impractical for complex kernels. MRAM serves as the high-capacity bank but with higher access latency.
- No direct inter-DPU communication: All cross-DPU synchronization—critical for reduction and collective operations—must be managed via the host, which adds latency and limits scalability. This impacts distributed SGD scaling, global graph reductions, and multi-stage database operations (Frouzakis et al., 2 Apr 2025, Rhyner et al., 2024).
- Manual data movement management: Programmers and compiler frameworks must explicitly tile, pad, and align all MRAM↔WRAM and host↔MRAM exchanges, as there is no cache coherence or intelligent memory controller.
- Host orchestration overhead: Over-partitioning or allocating excess DPUs can increase latency due to DMA setup, zero-padding, and idle stalls.
6. Energy Efficiency and Practical Considerations
Empirical studies report that UPMEM deployments achieve energy reductions of 1–2 over CPU-based baselines, especially in memory-bound or streaming workloads (Jannat et al., 15 Apr 2026, Chen et al., 2023, Gómez-Luna et al., 2021). Energy savings are less pronounced (sometimes below parity) on workloads with high arithmetic complexity or global communication. At full utilization (2,560 DPUs), a system can draw several hundred watts, but energy per operation compares favorably to both CPUs and GPUs when arithmetic intensity is low.
Programming productivity is substantially enhanced with autotuning compilers and pattern-based frameworks, as evidenced by 3–4 code size reductions and performance on par with or superior to hand-tuned SDK implementations (Oliveira et al., 2023, Shin et al., 2024, Chen et al., 2023).
PIM-aware compiler and IR design, as in ATiM and SimplePIM, yield automatic management of DMA, tiling, blocking, and low-level synchronization, significantly reducing the complexity of porting and optimizing data-centric algorithms to UPMEM.
7. Future Directions and Architectural Evolution
Several research works recommend that future PIM architectures evolve in the following dimensions:
- Native hardware multiply/divide and FP support: To broaden applicability and close performance gaps on compute-intensive workloads.
- On-module/inter-DPU hardware collectives: Low-latency mesh or tree interconnects within a DIMM to enable scalable tree-reductions/allgathers/allreduces without host intervention. This is identified as critical for linear scalability in distributed ML and aggregation-heavy analytics (Frouzakis et al., 2 Apr 2025, Rhyner et al., 2024).
- Larger and/or hybrid scratchpad-cache hierarchies: Increased WRAM to 128–256 KB per DPU, or a hybrid model with small local caches or vector units, to exploit data locality and better support pointer-chasing and irregular access patterns.
- Enhanced DMA and host–PIM memory interfaces: Direct “zero-copy” access and broadcast/multicast DMA to reduce round-trip latency and maximize host-bus utilization (Klinger et al., 12 Feb 2026).
- Richer programming models: Deeper compiler integration (e.g., TVM/MLIR for tensor and data analytics), dynamic code generation, and fine-grained scheduling to exploit cross-layer optimizations (Shin et al., 2024, Chen et al., 2023).
- Algorithm-architecture co-design: To maximize the utility of PIM hardware, future ML and analytic frameworks should expose and exploit the bandwidth and memory locality profile, preferring communication-avoiding reductions and data-parallel, bandwidth-bound patterns.
UPMEM’s deployment as the first large-scale commercial PIM system provides a key experimental platform, illuminating the benefits and limitations of in-memory computing and shaping architectural innovation for memory-centric workload acceleration (Gómez-Luna et al., 2021, Hyun et al., 2023).