Papers
Topics
Authors
Recent
Search
2000 character limit reached

Processing-In-Memory (PIM) Overview

Updated 11 May 2026
  • Processing-In-Memory (PIM) is a computing paradigm that integrates computation into memory arrays to reduce data transfers and mitigate the memory wall.
  • It employs methods like Processing-Near-Memory and Processing-Using-Memory to perform in-place computing, yielding significant speedups and energy savings.
  • PIM enables advances in deep learning, database analytics, and graph processing by enhancing throughput and reducing energy consumption across various workloads.

Processing-In-Memory (PIM) is a computer architecture paradigm that integrates computational capabilities directly within or near memory arrays to minimize data movement between processing units and memory. PIM architectures target the fundamental inefficiency—termed the “memory wall”—of von Neumann systems, where the energy and latency of transporting data via narrow interfaces dominates many workloads, especially those that are memory-bound and data-intensive. By embedding computation at or within the locus of data storage, PIM fundamentally alters the scaling relationships of throughput, energy efficiency, and system performance, making it central to current research in systems architecture, circuit design, and applications across deep learning, database analytics, and scientific computing (Mutlu et al., 2020, Ghose et al., 2019, Duan et al., 25 May 2025).

1. Motivations and Architectural Principles

Data movement costs dominate modern systems. Empirical analyses demonstrate that transferring a 64 B cache line from DRAM to a CPU may consume several hundreds of picojoules, while arithmetic on that data in the CPU requires orders of magnitude less energy (Oliveira et al., 2022, Mutlu et al., 2020). In high-throughput and data-centric domains (e.g., machine learning, graph analytics, data warehousing), the overhead of shuttling data between physically separated compute and memory subsystems imposes prohibitive energy and latency penalties, constituting up to 80–90% of system energy (Ghose et al., 2019).

PIM architectures challenge this separation by provisioning compute logic modules—

  1. Processing-Near-Memory (PNM): Co-locates programmable logic (in-order cores or accelerators) with DRAM via 3D stacking or package-level integration. Example platforms include logic layers in 3D-stacked DRAM (HBM, HMC) and commercial systems such as UPMEM DIMMs (Mutlu et al., 2020, Oliveira et al., 2022, Mutlu et al., 2019).
  2. Processing-Using-Memory (PuM): Performs computation by exploiting the analog/electrical properties of the memory array itself, executing logic at the cell or subarray level using specialized timing or sense amplifier tricks (e.g., RowClone, Ambit, SIMDRAM) (Mutlu et al., 2020, Oliveira et al., 2022, Mutlu et al., 2019).

The computational model of PIM replaces traditional movement of data-to-compute with the principle “compute-where-the-data-lives,” essentially transforming memory into an active substrate.

2. Canonical PIM Mechanisms and Substrates

A deep taxonomy of PIM designs includes:

PIM Class Mechanism Example Substrates
Processing-Near-Memory (PNM) Logic layer in 3D DRAM HMC/HBM, UPMEM
Processing-Using-Memory (PuM) DRAM cell/sense amp analog Ambit, RowClone
Bit-serial SRAM/Nonvolatile PIM Compute-enabled SRAM/RRAM PIMSAB, DB-PIM

PNM achieves high effective bandwidth by leveraging wide internal interfaces (TSVs) and integrating programmable cores or fixed-function accelerators(Mutlu et al., 2020, Oliveira et al., 2022). For instance, Tesseract provides graph-processing engines per vault, and UPMEM exposes thousands of DPUs (Data Processing Units) per system, with local memory and simple RISC-like cores(Mutlu et al., 2019, Barkhordar et al., 9 Feb 2026).

PuM exploits physical effects. RowClone performs in-DRAM page copying using back-to-back activations; Ambit achieves AND/OR/NOT via triple-row activations and dual-contact sense amplifiers, implementing Bitwise Majority logic at the subarray level. Bulk-bitwise operations are conducted in parallel across thousands of bits, delivering ultrahigh throughput and marked energy efficiency improvements (e.g., 44× speedup and 35× energy reduction over CPU for Ambit)(Mutlu et al., 2019, Mutlu et al., 2020, Perach et al., 2022).

Bit-serial SRAM-PIM (e.g., PIMSAB, DB-PIM) leverages crossbar architectures and custom in-SRAM digital logic(Duan et al., 25 May 2025, Arora et al., 2023), exploiting algorithm-architecture co-design for further efficiency enhancements, notably through aggressive sparsity management.

3. Algorithm-Architecture Co-Design and Workload Mapping

Achieving PIM’s potential involves significant co-design at both hardware and software layers. The Dyadic Block PIM (DB-PIM) approach exemplifies this trend by enforcing algorithm-driven value-level and bit-level sparsity, tightly aligning data pruning, bit encoding (Canonical Signed Digit), and custom digital SRAM macros (e.g., IPU, DBMU, CSD adder tree). Hybrid pruning combines:

  • Value-level mask MvalM^{val} achieving block-wise skip for coarse pruning.
  • Bit-level mask MbitM^{bit} targeting sparsity in encoded weights, exploiting the structured nature of real applications where zeros cluster at the value and bit levels (Duan et al., 25 May 2025).

Macros like the Input Pre-processing Unit dynamically skip all-zero input-columns, while the hierarchy of Dyadic Block Multiply/Adder units maps precisely onto the compressed sparsity structure, yielding up to 8.01× speedup and 85.28% energy savings for dense neural network workloads.

For workloads with regular, high data-parallelism (linear algebra, convolution, reduction), mapping onto bank-level parallel PIM—typified by UPMEM, HMC/HBM, or even crossbar RRAM—enables near-optimal exploitation of on-die memory bandwidth(Barkhordar et al., 9 Feb 2026, Arora et al., 2023).

Database operations leverage bulk-bitwise crossbar PIMs for in-place filter, GROUP-BY, and aggregation (e.g., tree-of-bitwise adders), achieving full OLAP workloads inside the memory module(Perach et al., 2023, Perach et al., 2022). Data is often laid out in bit-sliced form to enable maximal bit-parallel exploitation and mapped physically to minimize transfer and maximize locality.

In ML and transactional workloads, UPMEM-like architectures enable the offloading of both compute-intensive and memory-bound phases (e.g., parallel SGD, transactional memory) by decomposing the workload into fine-grained partitions assigned to large numbers of lightweight near-memory cores(Rhyner et al., 2024, Lopes et al., 2024).

4. System-Level Software, Programming Models, and Toolchains

Adopting PIM requires a rethinking of both programming interfaces and system software:

  • Programming Models: Support for familiar abstractions (e.g., iterator-based APIs, transactional memory, offloading directives), as well as PIM-enabled instructions (PEIs) for fine-grained, semantics-preserving offload, is essential for deployment (Ghose et al., 2019, Oliveira et al., 2022, Chen et al., 2023, Lopes et al., 2024).
  • Compiler and Runtime: Automated partitioning of code into PIM and host-resident segments leverages static analysis of arithmetic intensity, parallelism, and data-locality(Jiang et al., 2024). Tools like A³PIM statically analyze memory access patterns, port pressure, and communication costs, statically mapping code to PIM vs. CPU while mitigating cross-segment data movement and synchronization penalties. This yields mean speedups of 2.63× vs. CPU-only and 4.45× vs. PIM-only(Jiang et al., 2024).
  • OS and Memory Management: Support for PIM-aware virtual memory (IMPICA), data placement, and page migration is required to maximize PIM utility and adapt to mm-scale hardware budgets and mapping constraints (Oliveira et al., 2022, Lee et al., 2024).
  • Data Transfers: Efficient management of host⇔PIM transfers is critical in commercial systems where address space partitioning and mapping (e.g., via PIM-MMU’s hardware copy engine and scheduling(Lee et al., 2024)) mitigate bottlenecks arising from conventional memory controller limitations.

Benchmarks and simulation frameworks at device, circuit, architectural, and full-system levels (e.g., NVSim, PIMSim, gem5, PIMeval+PIMbench) underpin experimental evaluation and design exploration(Aghaei et al., 26 Nov 2025).

5. Quantitative Impact and Representative Workloads

A broad spectrum of application domains demonstrates PIM’s quantitative impact:

  • Deep Learning Inference: DB-PIM achieves speedups up to 8× and energy reductions >85%, exploiting value/bit-level sparsity for CNN layers while retaining full integer/floating-point precision(Duan et al., 25 May 2025).
  • Database Analytics: Bitwise crossbar PIMs accelerate full OLAP workloads (filter, GROUP-BY, JOIN), outperforming MonetDB by 4–8× in speed and reducing energy by 4–18× using memristive or RRAM arrays (Perach et al., 2023, Perach et al., 2022).
  • Graph Analytics: In real PIM systems (e.g., UPMEM), linear algebraic algorithms (SpMV, SpMSpV) yield 10–50× kernel-level speedup, up to 2–4× end-to-end speedup vs. best CPU baselines, and balanced energy usage, although utilization is defined by memory partitioning and communication constraints(Barkhordar et al., 9 Feb 2026).
  • ML Training: Distributed SGD and ADMM on PIM can achieve parity or surpass high-end CPUs/GPUs in memory-bound regimes, especially for quantized or fixed-point kernels, with 3× energy gains and ~8× strong scaling up to several thousand DPUs(Rhyner et al., 2024).
  • Transactional and Data-Structure Operations: PIM-STM and PIM-tree support high-throughput transactional memory and index operations, scaling to batches of millions of queries with 10–70× throughput gains under skewed access(Lopes et al., 2024, Kang et al., 2022).

6. Cross-Layer Challenges and Open Research Directions

PIM adoption confronts significant cross-layer challenges:

  • Reliability and Manufacturing: Integration of logic with DRAM or emerging memories (PCM, RRAM, MRAM, QCA) increases variability sensitivity and process challenges(Chougule et al., 2016). Some specialized technologies (QCA, for example) demonstrate compelling theoretical properties (e.g., 0.33 meV/operation, nanosecond delays), but require cryogenic operation and high-fidelity fabrication.
  • Thermal and Power Management: 3D-stacked PIM cores are subject to more severe thermal constraints due to confined die stacking; power budgets for logic are tightly limited(Mutlu et al., 2019).
  • Programming Models: The split between composability and low abstraction (favoring adoption) vs. expressiveness and optimization (exposing PIM details) is unsolved; DSLs, autocompilation, and PIM libraries (e.g., SimplePIM) are under active development(Chen et al., 2023).
  • Coherence and Synchronization: Scalable, low-overhead cache coherence for mixed CPU+PIM execution is required; mechanisms including region-based invalidation, signature-based conflict detection (CoNDA, LazyPIM), and decoupled consistency are explored(Ghose et al., 2019, Oliveira et al., 2022).
  • Inter-PIM Communication: UPMEM-style DPUs lack direct interconnect, limiting decentralized, collective computation (e.g., all-reduce); hardware networks among near-memory compute units would allow more scalable distributed algorithms(Barkhordar et al., 9 Feb 2026, Rhyner et al., 2024).
  • Simulation, Benchmarking, and Standardization: Instruments for reproducible, validated evaluation across abstraction levels (device ⇄ algorithm) and standardized APIs/ISAs are an active area(Aghaei et al., 26 Nov 2025).

Future research must tightly integrate algorithm–architecture co-design, develop PIM-aware compilers and runtimes, and address manufacturability and reliability for emerging memory technologies. The overall direction portends a paradigm shift away from processor-centric to fully memory-centric system design.

7. Broader Impact and Outlook

PIM architectures are positioned to deliver:

  • Orders-of-magnitude reductions in data-movement energy (often 10–100×).
  • 2–10× speedups for memory-bound and bandwidth-constrained workloads.
  • Architectural and software co-optimization as a foundational design principle.

Adoption will depend on robust support for programming, simulation, and system integration, coupled with continued progress in memory logic integration and cross-layer reliability.

PIM’s evolution marks a shift to data-centric computing, where storage arrays become active computation substrates and system organization is reimagined for memory-side intelligence(Mutlu et al., 2020, Oliveira et al., 2022, Mutlu et al., 2019, Duan et al., 25 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Processing-In-Memory (PIM).