Processing-In-Memory Devices

Updated 26 November 2025

Processing-In-Memory (PIM) devices are architectures that integrate computation within or near memory to reduce data movement overheads.
They leverage in-DRAM operations, analog crossbars, and specialized logic to accelerate bitwise and matrix computations in memory-bound workloads.
PIM devices offer energy-efficient neural inference and high-throughput analytics, though challenges in scalability, standardization, and integration persist.

Processing-In-Memory (PIM) Devices comprise a class of hardware architectures in which computational primitives are physically integrated within or adjacent to memory arrays, enabling processing to be performed where the data resides. The motivation is to overcome the bandwidth and energy inefficiencies imposed by the separation of processor and memory in the conventional von Neumann model, which manifests as the so-called "memory wall" in memory-bound workloads. PIM approaches exploit advancements in memory device physics, circuit integration, and architectural design to execute data-parallel operations—sometimes even vector-matrix multiplications and full digital logic—directly within DRAM, SRAM, emerging non-volatile memories, or allied logic layers. This article provides an in-depth account of PIM device taxonomy, circuit techniques, system-level integration, software and programming abstractions, and benchmarks, as established by recent research and experimental prototypes.

1. PIM Device Taxonomy and Architectural Paradigms

PIM devices are primarily classified by the locus and nature of compute capability:

Processing-Using-Memory (PuM): Computation occurs via direct exploitation of the electrical or analog properties of memory cells and their peripheral circuits. Exemplars include in-DRAM bitwise logic (AND, OR, NOT) using triple-row activation (Ambit), row/column copy via back-to-back activations (RowClone), and analog resistive RAM crossbars implementing vector-matrix multiplication via Kirchhoff’s law (Mutlu et al., 2020, Oliveira et al., 2022, Chakraborty et al., 15 Sep 2025).
Processing-Near-Memory (PnM): Dedicated logic (e.g., small RISC cores, SIMD accelerators, application-specific units) is co-located with memory, frequently in the logic tier of a 3D-stacked DRAM (HBM/HMC), or integrated on-DIMM (UPMEM). These processing elements access memory banks via low-latency, high-bandwidth internal channels, typically through through-silicon vias or in-package buses (Gómez-Luna et al., 2021, Gómez-Luna et al., 2021, Mutlu et al., 2020).

Recent hybrid architectures introduce nonvolatile memories (MRAM, ReRAM) either as computational substrate or as high-density storage layers. Heterogeneous-Hybrid PIM (HH-PIM) combines both MRAM and SRAM banks, distributed among high-performance and low-power modules, with dynamic data placement optimizing energy and latency (Jeon et al., 2 Apr 2025).

2. Device and Circuit Techniques

PIM devices employ a diverse array of underlying circuit techniques to execute compute primitives:

DRAM-based PuM: Ambit and related designs utilize sense amplifier dynamics and charge sharing via simultaneous multi-row activation to realize majority, AND/OR, and NOT gates in-situ. The operation's throughput is dictated by DRAM timing constraints (e.g., $t_{RCD}$ , $t_{RAS}$ ) and internal bank parallelism, yielding internal bitwise bandwidth far exceeding the off-chip interface (Mutlu et al., 2020, Mutlu et al., 2019).
Nonvolatile Memory Crossbars: RRAM and PCM arrays implement analog current summing (vector-matrix multiplication), or, in the digital regime, stateful logic (IMPLY, MAGIC, FELIX). For instance, memristive PIM arrays execute broadcasting/shifting and multi-stage carry-save multiplication for word-level arithmetic at $O(N\log N)$ cycles per $N$ -bit operand, leveraging partitioned crossbars and dynamic in-row wiring (Leitersdorf et al., 2021, Eliahu et al., 2022).
SRAM-based PIM: Recent designs augment standard 6T-SRAM with additional resistive elements (6T-2R) to enable analog MACs directly in cache arrays, using power-rail summing buses and gated ground for isolation, while preserving standard read/write pathways (Chakraborty et al., 15 Sep 2025, Duan et al., 25 May 2025). Digital approaches implement local logic (e.g., AND, OR, adders) exploiting the crossbar structure and skip logic to exploit both value-level and bit-level sparsity in stored weights.
Peripheral Optimization: RRAM-based analog accelerators are often limited by the overhead of high-resolution ADCs and DACs. Innovations such as neural approximation of shift-and-add networks and ADCs minimize conversion frequency and energy while maintaining accuracy (Cao et al., 2022).
Advanced Bank- and Array-Level Routing: Shared-PIM augments DRAM subarrays with shared rows and bank-wide buses, allowing concurrent computation and data movement, thus minimizing pipeline stalls due to copy operations (Mamdouh et al., 28 Aug 2024).
Quantum-dot Cellular Automata (QCA): Implement Akers’ array logic at the nanoscale by leveraging quantum-dot polarization for both computation and storage within the same cell footprint (Chougule et al., 2016).

3. System Integration, Scaling, and Limitations

Integrating PIM capabilities at system scale presents multiple challenges and design trade-offs:

Area and Power: Analog and digital PIM often incur area overhead due to peripherals (e.g., ADCs, specialized inter-bank buses). For instance, in 6T-2R SRAM PIM, SAR ADCs can account for ≈70% of macro area and be the critical gate for achievable throughput (Chakraborty et al., 15 Sep 2025). Shared-PIM yields a 7.16% area overhead for full bank-level copy-compute concurrency (Mamdouh et al., 28 Aug 2024).
Energy and Throughput: State-of-the-art devices achieve energy efficiencies up to hundreds of TOPS/W (6T-2R NVM-in-Cache: 491.78 TOPS/W at 0.4 TOPS peak throughput) (Chakraborty et al., 15 Sep 2025). RRAM-based Neural-PIM achieves >5× energy and >3× throughput improvement over predecessors by reducing A/D conversion overhead (Cao et al., 2022).
Scalability: Performance and energy efficiency scale with memory depth and kernel size, but ultimately constrained by analog-digital conversion bottlenecks, device programming energy (e.g., RRAM write voltage requirements), and data locality. Higher-precision arithmetic or increased bank utilization require larger or faster peripherals.
Programmability and Memory Management: Distributed PIM architectures (e.g., UPMEM’s bank-coupled DPUs) highlight the necessity of system support for memory allocation, address mapping, and data transfer acceleration (e.g., PIM-MMU for efficient host↔PIM data movement) (Lee et al., 10 Sep 2024, Lee et al., 19 May 2025).
Data Locality and Placement: Latency in 3D-stacked arrays with disaggregated compute is often dominated by network transfer and remote bank queuing. Adaptive data migration (DL-PIM) can reduce average memory latency by ≈50% and improve overall speedup for high-reuse workloads by relocating hot blocks to local compute regions, subject to hardware programmable indirection (Tian et al., 9 Oct 2025).

4. Algorithms, Programming Models, and Toolchains

The practical realization and adoption of PIM depend on matched software abstractions, compiler infrastructure, and workload adaptation:

Algorithm–Architecture Co-design: Efficient DNN inference in digital SRAM-PIM is enabled via hybrid-grained pruning and CSD-dyadic block encoding, maximizing effective sparsity, skipping unnecessary computations, and achieving up to 8× speedup and 85% energy savings for representative neural workloads (Duan et al., 25 May 2025).
Programming Models: Frameworks such as SimplePIM raise productivity by abstracting data partitioning, DRAM/MRAM transfers, collective communication, and kernel invocation in high-level host APIs and handle-based kernel launches. Hand-optimized code is reduced by up to 83%, with comparable or improved performance (Chen et al., 2023).
ISAs and Compilation: abstractPIM proposes a technology-independent intermediate representation, with per-technology microcode mapping for flexible deployment across diverse logic families (MAGIC, IMPLY, CRS). This decouples software compatibility from PIM array hardware, though at some execution cycle penalty (Eliahu et al., 2022).
OS and Memory Management: Custom allocators (PIM-malloc) utilize distributed per-core heap metadata and hardware-accelerated caches, attaining >66× faster dynamic memory allocation and 28× boost in dynamic graph workloads (Lee et al., 19 May 2025). Efficient MMUs (PIM-MMU) and dual-mode address mapping (HetMap) recover nearly all of the host-side memory-level parallelism lost due to static partitioning (Lee et al., 10 Sep 2024).
Coherence and Consistency: Fine-grained, high-frequency cache coherence is inefficacious in PIM due to off-chip traffic. Speculative transactional approaches (LazyPIM) use compressed (Bloom-filter) coherence signatures, checking only at transaction boundaries and enabling 19.6% average application speedup and 30.9% reduction in off-chip coherence traffic relative to prior best schemes (Boroumand et al., 2017).

5. Performance and Workload Benchmarks

Evaluation on both synthetic benchmarks and real workload suites demonstrates the unique strengths and boundaries of PIM architectures:

DRAM-PIM and Digital Crossbars: Bulk operations (bitwise logic, row-copy) on DRAM-PIM deliver >35× energy and >40× throughput improvements over CPU baselines. Digital PIM outperforms GPU for memory-bound arithmetic but falls behind as data reuse or MAC intensity increases (Leitersdorf et al., 2023, Mutlu et al., 2020).
Emerging NVM PIM: 6T-2R NVM-in-Cache matches FP32 inference accuracy on ResNet-18 within 0.57% and achieves 4.37 TOPS/mm² compute density (Chakraborty et al., 15 Sep 2025). RRAM Neural-PIM architectures exhibit negligible accuracy loss on deep networks while reducing A/D conversion count and energy (Cao et al., 2022).
Commercial Systems: UPMEM’s DPUs with bank-local DRAM deliver up to 23× CPU and 2.5× GPU speedup on memory-bound workloads, with energy savings up to 5× on select benchmarks. Performance is optimal for local, compute-bound kernels with minimal inter-DPU synchronization (Gómez-Luna et al., 2021, Gómez-Luna et al., 2021).
Applications: PIM excels for low-data-reuse, high-bandwidth workloads: streaming analytics, graph traversals, bitwise database queries, neural inference with structured sparsity, and dynamic graph updates. Computationally intensive, high-reuse, or deeply interconnected work, such as dense linear algebra, does not favor current digital PIM approaches.

6. Adoption Challenges and Future Directions

Despite substantial advances, several critical issues moderate widespread PIM adoption:

Device Reliability and Scaling: Analog and NVM-based PIM architectures must counter variability, drift, and physical disturbance (e.g., RowHammer), particularly as geometries scale (Mutlu et al., 2020).
Interface and Standardization: Absence of unified PIM ISAs, cache coherence protocols, and heterogeneous memory addressability impedes integration into conventional memory hierarchies (Oliveira et al., 2022).
Toolchain Maturity: Robust hardware–software co-design, with compiler-level orchestration of PIM offload, data layout, and kernel transformation, remains under development. Existing toolchains (e.g., DAMOV, Ramulator-PIM, PrIM) and standard benchmarks are foundational enablers (Oliveira et al., 2022, Gómez-Luna et al., 2021).
Virtualization and Security: PIM in shared and multi-tenant settings requires extensions for protected memory domains, secure virtualization, and side-channel mitigation (Oliveira et al., 2022).

Looking ahead, key directions include further analog peripheral co-design, aggressive exploitation of sparsity and data placement, extension to broader classes of in-memory analytics and AI workloads, development of cross-technology compilation flows, and integration with near-data accelerators in multi-die systems. The emergence of drop-in PIM extensions to commodity SRAM and DRAM, heterogeneous hybrid NVM/volatile memory arrays, and advanced bank/interconnect topologies anticipates a transition toward memory-centric, workload-adaptive computing architectures.

References: