Processing-in-Memory Architectures
- Processing-in-memory (PIM) architectures are designs that integrate computation directly within memory modules using techniques like near-memory logic and in-array processing.
- The PIM paradigm is split into Processing-Near-Memory (PnM) for low-latency offloading with embedded accelerators and Processing-Using-Memory (PuM) for efficient in-place bitwise operations.
- Successful PIM deployment requires robust software support, specialized programming models, and effective memory coherence to accelerate analytics and deep learning tasks.
Processing-in-memory (PIM) architectures represent a fundamental departure from conventional, processor-centric design, targeting the data movement bottleneck by enabling computation directly where data is stored. PIM encompasses multiple device and architectural technologies that either embed programmable logic adjacent to memory (processing-near-memory, PnM) or directly leverage the physical properties of memory arrays for computation (processing-using-memory, PuM). This article presents a comprehensive view of the principles, design space, system-level implications, and research trajectory of PIM, with rigorous grounding in recent research advances.
1. Taxonomy of PIM Architectures
PIM architectures are broadly classified into two fundamental paradigms based on integration approach and computational model (Oliveira et al., 2022, Mutlu et al., 2020):
- Processing-Near-Memory (PnM):
- Integrates general-purpose or domain-specialized logic elements (ALUs, SIMD blocks, soft-cores, accelerators) into the logic layer of a 3D-stacked memory module, such as Hybrid Memory Cube (HMC) or High-Bandwidth Memory (HBM).
- Enables rich instruction sets, including PIM-Enabled Instructions (PEI), and can offload both simple and complex tasks with near-DRAM bandwidth and low energy per bit moved.
- Exemplified by systems like HMC 2.0 with embedded accelerators for graph traversal and machine learning inference.
- Processing-Using-Memory (PuM):
- Exploits the native analog or digital operations of memory cell arrays (e.g., DRAM, ReRAM, PCM) for in-place computation.
- In DRAM, involves primitives such as triple-row activation for majority logic (MAJ), implementing bulk bitwise AND/OR/NOT (cf. Ambit), and in situ copy (RowClone).
- Emerging NVM crossbars (ReRAM, PCM) support analog multiply-accumulate via current summation.
- Operations map to entire rows or columns, often executed bit-serially (SIMDRAM) or via highly parallel stateful logic (AritPIM, MIMDRAM) (Leitersdorf et al., 2022, Oliveira, 27 Aug 2025).
The distinction is summarized below:
| Paradigm | Computational Layer | Primitives | Device Examples |
|---|---|---|---|
| Processing-Near | Logic layer beside/under DRAM | ISA-defined instructions, ALUs | HMC, HBM, UPMEM DIMMs |
| Processing-Using | In-array (bitcell/crossbar) | Row activation, MAJ, AND, OR | DRAM, ReRAM, PCM |
2. Architectural Principles and Integration Techniques
Die-Stacked Memory and Logic-Layer Integration
3D-stacked memories, such as HMC and HBM, utilize through-silicon vias (TSVs) to achieve massive internal bandwidth (up to several hundred GB/s per stack) (Oliveira et al., 2022). The logic layer beneath the DRAM dies hosts PIM compute units, which directly access the data arrays via TSVs, minimizing access latency (10–20 ns) and reducing energy per bit to ≈0.02 nJ/bit (vs. ≈0.2 nJ/bit for off-chip transfers) (Ghose et al., 2018). PnM architectures exploit this locality for streaming computation and high-throughput parallelism (e.g., graph algorithms, in-memory analytics) (Kim et al., 2023).
Cell-Operation Primitives and Data Mapping
PuM techniques are device-physics-aware and often operate at the granularity of entire DRAM rows or NVM crossbar bitlines. DRAM-centric techniques, such as triple-row activation for MAJ and RowClone/RowBuffer Movement (RBM), allow bulk logic and copy operations without data surviving the I/O interface. Data must often be physically mapped in vertical layouts to align wordlines with operand bits, with transposition units correcting layout mismatches in software or hardware (as in SIMDRAM) (Oliveira et al., 2022, Leitersdorf et al., 2022, Oliveira, 27 Aug 2025).
SRAM-PIM approaches embed small logic near each bitcell for highly parallel bit-serial compute, and advanced designs (PIMSAB, DB-PIM) develop specialized macroarchitectures and sparse dataflow co-design (Arora et al., 2023, Duan et al., 25 May 2025).
3. Performance, Energy, and System Modeling
Quantitative evaluation of PIM must capture compute, data movement, and queuing dynamics:
- Energy per Bit Moved: , with off-chip movement dominating cost (≈10 pJ/bit, DDRx baseline).
- Bandwidth Utilization: ; PnM can approach near-unity, while CPU-DRAM often stalls at 0.4–0.6 (Oliveira et al., 2022).
- Total Latency: ; PnM reduces data transfer time, while PuM can drive it to zero for true in-place operations.
- Key System Models: Accurate modeling further incorporates memory controller contention, bank conflicts, and in-DRAM sequence latency for PuM, as well as network-on-chip modeling for manycore PIM deployments (Sharma et al., 2024, Arora et al., 2023).
Table: Representative Throughput and Energy Gains Reported
| Workload | Speedup (vs CPU) | Energy Savings | Reference |
|---|---|---|---|
| Bulk bitwise (Ambit) | 44× | 35× | (Mutlu et al., 2020) |
| PnM DB operators | 4–14× | 85.7% reduction | (Kim et al., 2023) |
| Graph analytics (Tesseract) | ~14× | 87% reduction | (Mutlu et al., 2020) |
| SRAM-PIM DNN (PIMSAB) | 3–4× (vs GPU) | 4× | (Arora et al., 2023) |
| SRAM-PIM DNN (DB-PIM) | 8× | 85% reduction | (Duan et al., 25 May 2025) |
4. Workload Characterization and Benchmark Methodologies
Processing-in-memory benefits are acutely workload-dependent. Two notable public suites have emerged:
- DAMOV: An architecture-independent benchmark suite and classification methodology for data-movement bottlenecks, identifying 144 memory-bound functions from 74 applications (HPC, databases, graph analytics) and empirically mapping them onto six classes (DRAM bandwidth-bound, DRAM latency-bound, L3 contention-bound, etc.) (Oliveira, 27 Aug 2025, Oliveira et al., 2022). This enables predictive selection of PIM-suitable kernels.
- PrIM: The UPMEM-based PIM benchmark suite spanning 16 workloads (linear algebra, graph, DNN, analytics) for real hardware, enabling direct performance/energy measurement on modern commercial PnM systems (Gómez-Luna et al., 2021).
Graph analytics, sparse matrix computations, key-value filters, and DNN layers with high last-level cache MPKI and low temporal locality are prime candidates for PnM offload. PuM excels at bitwise-dominated workloads such as bitmap index scans and bulk filtering.
5. Software, Programming, and System Support
Programming Models and Compilers
Robust PIM adoption hinges on full-stack system and programming support:
- APIs and High-Level Libraries: PnM exposes offload primitives (e.g., PIM_Put, PIM_Get) often embedded in OpenCL/OpenACC; PuM systems like SIMDRAM provide a compiler/controller transforming programmer-facing add/multiply/relational ops into DRAM row-activation sequences (Oliveira et al., 2022, Oliveira, 27 Aug 2025).
- Compiler Optimizations: Data placement analyses, kernel fusion (to reduce transfer call overhead), auto-vectorization, and PIM-specific code generation (e.g., via polyhedral transformations) are required for efficient mapping of applications (Yang et al., 19 Nov 2025, Oliveira et al., 2022). Frameworks such as DCC co-optimize data rearrangement and compute scheduling for ML workloads (Yang et al., 19 Nov 2025).
- Operating System Extensions: PIM-aware page allocation tags, NUMA policies to co-locate data with PIM logic, virtual memory support for PIM address translation, and hardware/software coherence mechanism integration (LazyPIM, CoNDA) are essential features (Oliveira et al., 2022, Boroumand et al., 2017).
Coherence and Consistency
Maintaining shared-memory and coherence semantics in hybrid PIM–CPU environments is nontrivial:
- LazyPIM: Speculative execution of PIM kernels with address tracking in compact Bloom filter "coherence signatures," followed by batched, compressed conflict checking at kernel commit time. Trade-off: rare rollbacks on signature false positives, but dramatically reduced off-chip traffic—30.9% less than coarse locks, >86% less than uncoherent CPU, and <10% performance loss relative to an ideal PIM baseline (Boroumand et al., 2017, Oliveira et al., 2022).
Trade-offs exist between strict hardware-enforced coherence (with frequent invalidations) and software-managed consistency (simpler, but programmer-burdened).
6. Case Studies in Hardware Prototypes and Commercial Devices
- UPMEM: Commercial DDR4 PIM-DIMMs with per-bank RISC cores (DPUs), supporting up to 2,560 DPUs per server, each with its own DRAM bank. The programming model is SPMD, backed by a C toolchain, host-to-DPU DMA interface, and explicit DMA between local/scratch memories (Gómez-Luna et al., 2021, Hyun et al., 2023). Performance studies with the PrIM suite demonstrate near-linear scaling for streaming or synchronization-light kernels, with compute-bound behavior dominating at modest operational intensity.
- SRAM-PIM (PIMSAB, DB-PIM): Hierarchical tile architectures with bit-serial compute and synergetic scheduling/compiler co-design, achieving several-fold speedups and energy reductions compared to leading GPUs or DRAM-PIM baselines (Arora et al., 2023, Duan et al., 25 May 2025).
- Shared-PIM: Advanced in-DRAM PIM designs that decouple compute and data movement through bank-level low-latency buses and shared rows, yielding simultaneous 5× transfer latency reductions and up to 40% kernel-level speedup at ~7% area overhead (Mamdouh et al., 2024).
7. Open Research Challenges and Future Directions
Current research points to several essential frontiers:
- Programming, Abstractions, and Debugging Tools: Domain-specific languages/execution models exposing PIM with transparency, debugging toolchains for in-memory execution flows, and portable ISA extensions (e.g., to RISC-V, ARM) for interoperable PIM across vendors (Oliveira et al., 2022, Oliveira, 27 Aug 2025).
- Standardization and Systemization: Unification of PIM ISAs, standardized OS and hypervisor support for live migration and virtualization, and uniform simulation frameworks (e.g., Ramulator-PIM, PiMulator) for end-to-end study and validation (Aghaei et al., 26 Nov 2025, Oliveira et al., 2022).
- Reliability and Security: Mitigation of process-variation-induced and RowHammer faults, robust ECC and error-tolerant mechanisms (as in FAT-PIM), and secure isolation between independent PIM processes (Zubair et al., 2022, Oliveira et al., 2022).
- Heterogeneous and Hierarchical Integration: Co-design with GPU/TPU accelerators, integration of PIM with different memory technologies (DRAM, NVM, SRAM), multi-level chiplet or interposer/3D stacking, and dynamic dataflow mapping for mixed and irregular workloads (Sharma et al., 2024).
- Commercialization and Productivity: Demonstration of compelling cost-performance-energy for real-world applications, open hardware/software ecosystem, automated code generation, and programmer productivity frameworks (e.g., DaPPA) (Oliveira, 27 Aug 2025, Yang et al., 19 Nov 2025).
- Virtual Memory and Data Placement: Transparent and efficient hardware/software split for memory management, PIM-aware virtual memory and page allocation, and dynamic locality-driven data remapping mechanisms (Lee et al., 2024, Tian et al., 9 Oct 2025).
- Simulation and Benchmarking: Ongoing need for reliable, scalable, and physically validated simulation for diverse PIM architectures, standard benchmark suites (DAMOV, PrIM), and rapid design-space exploration tools for co-design studies (Aghaei et al., 26 Nov 2025).
PIM is thus at a critical juncture, with commercial systems demonstrating feasibility, but adoption at scale contingent on holistic advances at the architectural, system, toolchain, and programming levels. The continuing trajectory of PIM research indicates that, as tool support and standards mature, PIM architectures will evolve into a mainstream data-centric computing paradigm, closing the memory-compute gap that constrains contemporary systems (Oliveira et al., 2022, Mutlu et al., 2020).