Processing-in-Memory (PIM) Overview

Updated 20 November 2025

Processing-in-Memory (PIM) is an architectural paradigm that integrates computation with memory, reducing the energy and latency overhead of data movement.
It combines in-memory and near-memory computing using substrates such as commodity DRAM, 3D-stacked memory, and emerging technologies like RRAM.
PIM accelerates data-intensive workloads in ML, graph analytics, and database queries while addressing the traditional von Neumann bottleneck.

Processing-in-Memory (PIM) is an architectural paradigm focused on mitigating the limitations of processor-centric computing by tightly integrating computation with memory resources. This data-centric approach primarily addresses the rapidly scaling overheads of data movement between memory and processors, which are now dominant in the energy and latency profiles of data-intensive applications. PIM enables computation either "in" memory (using the physical properties of memory devices for logic) or "near" memory (embedding programmable logic close to memory arrays), thereby collapsing the von Neumann bottleneck and exposing massive parallelism and internal memory bandwidth. The paradigm spans diverse substrates—commodity DRAM, 3D-stacked memory, SRAM macros, emerging resistive memories, and even quantum-dot cellular automata—with corresponding trade-offs in programmability, performance, system complexity, and workload mapping.

1. Motivation and Historical Foundations

The performance, energy, and scalability bottleneck in modern systems is primarily attributed to the separation of computation (CPU, GPU, accelerator) and data storage (DRAM, NVM), enforced by the narrow, high-latency off-chip memory bus. Off-chip data movement dominates both latency (tens–hundreds of nanoseconds per DRAM access vs. a few nanoseconds for in-core operations) and energy (e.g., a 64-byte DRAM fetch can consume ~640 pJ, >100× a 64-bit add) (Mutlu et al., 2020). Large-scale workloads—ML/AI, graph analytics, database engines—consistently show that memory access and data movement account for 35–62% of total system energy and overwhelm cache or on-chip compute resources when arithmetic intensity is low (Gómez-Luna et al., 2021).

The central thesis of PIM is to eliminate or drastically reduce this movement by enabling computation where the data resides. Early PIM concepts appeared in the 1960s and 1970s (e.g., logic-in-memory, Active Pages), but practical realization was limited by device, cost, and system design barriers. Today’s renewed interest leverages 3D-stacked integrations, industry prototypes (UPMEM, HBM-PIM, AxDIMM, AiM), nonvolatile technologies (RRAM/PCM), and new system software (Mutlu et al., 2020, Mutlu et al., 2019, Ghose et al., 2019).

2. PIM Architectural Taxonomy: Approaches and Primitives

2.1. Processing-Using-Memory (PUM)

PUM architectures exploit the intrinsic analog or digital behavior of memory cell arrays to perform computation directly inside storage arrays, typically with minimal logic overhead. Canonical mechanisms include:

DRAM analog operations (Ambit, RowClone): RowClone enables in-DRAM bulk copy/initialize by activating multiple rows sequentially for memcopy, yielding up to 11.6× speedup and 74.4× energy reduction for 4 KB transfers. Ambit leverages triple-row activation for bitwise MAJ, AND, OR, and NOT at subarray bandwidths up to hundreds of TB/s. Energy per DRAM operation (e.g., AND/OR) is of order 0.1 pJ/bit (Mutlu et al., 2020, Mutlu et al., 2019).
Bulk bitwise processing with RRAM/memristors: Bulk-bitwise primitives (e.g., IMPLY logic) in stateful resistive arrays support high-throughput operations for database analytics and filtering, and can surpass CPU-based systems by up to 608× for TPC-H queries with up to 18.6× energy savings (Perach et al., 2022).

2.2. Processing-Near-Memory (PNM)

PNM architectures embed programmable logic (simple in-order cores, vector units, or custom accelerators) in the logic die of 3D-stacked memories or at the bank periphery in conventional DIMMs (Mutlu et al., 2020, Leitersdorf et al., 2022):

3D-stacked HMC/HBM PIM: The logic layer, interconnected via TSVs, hosts per-bank or per-vault microcontrollers. Internal stack bandwidths reach 400–500 GB/s, and the logic-layer budget allows up to ∼200 mm² area and 5–10 W (Mutlu et al., 2020). Tesseract (ISCA’15) achieves a 13.8× speedup and 87% energy savings on graph analytics (Mutlu et al., 2020).
Digital SRAM-PIM: Augments SRAM arrays with small logic (bitwise gates, adder trees), performing parallel dot-products, as in DB-PIM. This approach achieves up to 8.01× speedup and 85.28% energy reduction compared to dense digital PIM (Duan et al., 25 May 2025).

2.3. Commodity and Emerging PIM Substrates

Commercial PIM (UPMEM): General-purpose DPUs (post-2021) with co-located DRAM, scratchpad, and low-power in-order cores. Aggregate bandwidth reaches >2 TB/s with >2,500 DPUs per server (Gómez-Luna et al., 2021, Gómez-Luna et al., 2021).
Novel substrates: QCA-based PIM, as realized in the Akers logic array, demonstrates nanoscale, ultra-low-power PIM concepts (Chougule et al., 2016).

3. Programming Models, Offloading Methodologies, and Runtime Systems

PIM requires explicit or compiler-guided workload mapping, synchronization mechanisms, and consistency primitives:

Granularity: Offloading ranges from single instructions (PEI), bulk primitives (RowClone, Ambit), function-level (pragma-annotated offload, e.g., TensorFlow Lite), to application-level (standalone PIM appliances) (Ghose et al., 2019).
Static tools (A $^3$ PIM): A static code analyzer partitions code by memory intensity, arithmetic intensity, and inter-region connectivity, achieving up to 7.14× speedup over CPU-only execution (Jiang et al., 23 Feb 2024).
Software transactional memory (PIM-STM): Rich software transactional memory abstraction tailored to PIM DPUs demonstrates up to 14.5× speedup and up to 5× energy efficiency on UPMEM hardware compared to CPU-STM (Lopes et al., 17 Jan 2024).
Distributed optimization/emulation frameworks: PrIM and SimplePIM expose distributed arrays, iterator APIs, and support for reductions, gathers, and host–DPU communication, offering up to 1.43× speedup over expert-tuned baselines with up to sixfold reduction in code complexity (Chen et al., 2023).

4. Systems-Level Considerations: Data Movement, Locality, Memory Management

Effective use of PIM hinges on minimizing off-chip transfers, exploiting data locality, and managing memory coherency and address spaces:

Host-to-PIM/DRAM–PIM data movement: In commercial systems, explicit data copy engines (PIM-MMU) dramatically accelerate DRAM↔PIM transfers, with average 4.1× throughput and energy efficiency improvements, resulting in 2.2× end-to-end speedup on real PrIM PIM workloads (Lee et al., 10 Sep 2024).
Locality-aware data migration: DL-PIM augments 3D-stacked PIM with dynamic per-vault migration and indirection tables, reducing per-request average latency by 54% (HMC) and increasing IPC by up to 15% for high-reuse workloads. Adaptive controller policies cap bandwidth overhead and prevent performance degradation in low-locality scenarios (Tian et al., 9 Oct 2025).
Skew resistance and load balancing: Architectural mechanisms such as PIM-tree achieve up to 69.7× throughput over prior PIM index designs by switching between push (PIM-executed) and pull (host-executed) queries depending on detected skew, thereby maintaining balanced load under worst-case key distributions (Kang et al., 2022).
Memory coherence and page management: Optimistic, signature-based coherence (CoNDA) and region-based page tables (IMPICA) are necessary to maintain correctness without incurring massive coherence traffic (Ghose et al., 2019).

5. Empirical Performance, Energy Efficiency, and Application Domains

PIM systems, across multiple substrates, consistently demonstrate order-of-magnitude energy savings and speedup for selected data-intensive kernels:

Large-scale ML and DNN Inference: PIM-DRAM achieves up to 19.5× speedup over a Titan Xp GPU for convolutional layers by integrating bitwise multiplication primitives and accumulation in commodity DRAM with negligible (sub-1%) area overhead (Roy et al., 2021). UPMEM PIM accelerates large-scale SGD, obtaining 1.94×–10.65× speedup (and substantial energy reduction) vs. CPU/GPU on memory-bound kernels (Rhyner et al., 10 Apr 2024).
Database analytics: Bulk-bitwise PIM in RRAM crossbar arrays yields 56×–608× end-to-end speedup on TPC-H queries and up to 18.6× energy reduction (Perach et al., 2022).
General PrIM benchmarks: On a 2,556-DPU UPMEM system, 10/16 PrIM kernels beat a Titan V GPU by an average of 2.54× (and Xeon by 23.2×) under memory-bound, low-arithmetic-precision workloads (Gómez-Luna et al., 2021, Gómez-Luna et al., 2021).
Graph analytics and filtering: Logic-layer PIM (Tesseract, HMC) sustains double-digit speedups and 80%+ energy cuts for BFS, PageRank, and filtering both in commercial and academic prototypes (Mutlu et al., 2020, Mutlu et al., 2020).
Programming productivity: SimplePIM demonstrates that high-level iterator APIs with runtime-optimized data movement and collective patterns can bring PIM performance in range (or above) expert manual implementations on modern hardware (Chen et al., 2023).

6. Ongoing Challenges, Limitations, and Future Directions

Despite demonstrated gains, PIM adoption and effectiveness are subject to several constraints:

Limited compute flexibility and instruction support: Bitwise and simple integer operations are efficient natively; floating-point and complex arithmetic often incur heavy emulation penalties (Gómez-Luna et al., 2021). SRAM-PIM advances such as DB-PIM with value-/bit-level sparsity pruning raise macro utilization, but full generality is not achieved (Duan et al., 25 May 2025).
Inter-node/inter-core communication: Most PIM substrates lack true low-latency on-die interconnect for aggregation or synchronization, with the notable impact on scaling of distributed ML algorithms (Rhyner et al., 10 Apr 2024).
Software toolchains and programming models: The absence of unified, portable APIs and runtime abstractions (language-level, compiler support, debug tools) remains a major barrier. Emerging programming frameworks (SimplePIM, PrIM) and toolflows (A $^3$ PIM, DAMOV) partially address this, but full-system support remains underdeveloped (Chen et al., 2023, Oliveira et al., 2022).
Coherence and virtual memory integration: Efficient, scalable cache coherence and zero-copy data sharing between PIM and host processors remain unresolved challenges. Hardware/software co-design of PIM-aware MMUs (PIM-MMU) and relaxed coherence (CoNDA, LazyPIM) are active areas (Lee et al., 10 Sep 2024, Ghose et al., 2019).
Security and isolation: PIM raises distinct concerns in isolation, error protection, and side-channel attack surface (e.g., RowHammer, NVM endurance) (Mutlu et al., 2020, Oliveira et al., 2022).
Scaling to new domains: Future trends include: hybrid logic+analog PIM, integration with emerging NVMs, support for direct atomic operations and collective communication primitives, and explicit hardware/software co-design for important algorithmic domains (decentralized ML, transactional memory, fairness-aware data migration) (Duan et al., 25 May 2025, Oliveira et al., 2022, Tian et al., 9 Oct 2025).

PIM is transitioning from a purely research concept to practical deployments, with empirical evidence for substantial improvements in system throughput and efficiency for data-centric workloads. Recognition of its limitations—both in scope and generality—fuels targeted research in hardware, programming infrastructure, operating systems, and security to fully realize the benefits of memory-centric, data-driven computing (Mutlu et al., 2020, Gómez-Luna et al., 2021, Oliveira et al., 2022).