Processing-in-Memory (PIM) Logic

Updated 27 January 2026

Processing-in-Memory (PIM) logic is an innovative approach that embeds computation within or near memory arrays to minimize data movement and boost throughput.
It leverages two main paradigms—processing-using-memory and processing-near-memory—using technologies like DRAM charge-sharing and memristive crossbars for in-situ Boolean and arithmetic operations.
PIM logic achieves significant energy, latency, and throughput improvements while posing challenges in device reliability, integration, and development of robust programming models.

Processing-in-Memory (PIM) logic refers to architectural, circuit, and device-level techniques that enable direct computation within or adjacent to memory arrays, minimizing data movement between memory and conventional processing units. PIM logic exploits the intrinsic properties of memory devices—whether volatile (e.g., DRAM, SRAM, eDRAM) or nonvolatile (e.g., memristors, STT-MRAM, DWM, QCA)—to embed the ability to perform Boolean and arithmetic computation either inside the memory cells ("processing-using-memory," or PUM) or in specialized logic integrated close to the memory array ("processing-near-memory," or PNM). Adoption of PIM logic has demonstrated order-of-magnitude throughput and energy improvements for data-movement-bound workloads, driving active research across the memory hierarchy, memory device physics, architecture, algorithms, and programming models (Mutlu et al., 2020, Mutlu et al., 2019, Hoffer et al., 29 Jun 2025, Leitersdorf et al., 2022).

1. Core Principles and Device Foundations

PIM logic is implemented via two fundamentally distinct approaches: processing-using-memory (PUM) and processing-near-memory (PNM).

Processing-Using-Memory (PUM):

PUM exploits analog or digital in-situ properties of the memory array to carry out Boolean logic or fixed arithmetic in-place:

DRAM Charge-Sharing: Simultaneous activation of multiple rows in DRAM causes capacitive charge sharing on the bitline; the sense amplifier resolves a majority/AND/OR function. For example, activating three rows, $a_i$ , $a_j$ , $a_k$ , yields

$V_{BL}\approx\frac{C_\mathrm{cell}(a_i+a_j+a_k)}{C_\mathrm{cell}+C_{BL}}$

Majority behavior is harnessed for AND/OR operations based on controlled initialization of source rows (Mutlu et al., 2020, Mutlu et al., 2019).

Memristive/Resistive Crossbars: Each cell at $(i,j)$ has conductance $G_{ij}$ ; applying wordline voltages $V_i$ yields a BL current $I_j=\sum_i G_{ij}V_i$ . Stateful logic (e.g., MAGIC, IMPLY, FELIX) re-uses the device for both storage and gate operations; row- or column-level partitioning accelerates parallel operations such as multiplication, popcount, or convolution (Leitersdorf et al., 2021, Leitersdorf et al., 2022, Leitersdorf et al., 2022).
Gain-Cell eDRAM (GC-eDRAM): Dual-port 3T architectures enable stateful logic by using a nondestructive read and a LOGIC pulse; in-array NOR/NOT is performed via out-of-band write–read sequences, preserving information and enabling full bit-plane operations (Hoffer et al., 29 Jun 2025).
Quantum-dot Cellular Automata (QCA), DWM: In QCA, the cell's polarization acts as data and logic. Pipelinable Akers arrays supporting Boolean primitives (F(X,Y,Z)=X¬Z+YZ) yield ultralow area/power in the sub-Kelvin regime (Chougule et al., 2016). Racetrack memories (DWM) use transverse resistance measurements to sense the aggregate bitwise state in a segment, enabling simultaneous multi-bit logic (Ollivier et al., 2021).

Processing-Near-Memory (PNM):

PNM integrates programmable or fixed-function logic within the logic layer of 3D-stacked DRAM (e.g., HMC, HBM), on-die eDRAM, or as embedded processors (e.g., UPMEM DPUs), directly accessing the high-bandwidth, low-latency internal memory channels. This model supports full RISC ALUs, SIMD units, and programmable PIM kernels (Mutlu et al., 2020, Gómez-Luna et al., 2021).

2. Logic Primitives, Partitioning, and Parallelism

PIM logic exposes a set of Boolean gates and arithmetic operators, either as fundamental in-array operations or via composition:

Device/Technique	Native Gates	Parallelism	Comments
DRAM TRA (Ambit)	MAJ, AND, OR, NOT	$2^{13}$ – $2^{14}$ bits/bank	Requires triple-row activation, sense amp
Memristor Crossbar	NOR, IMPLY, Min₃	$N$ bits/row × $P$ partitions	MAGIC/FELIX primitives, dynamic partition
SRAM/GC-eDRAM	NOR, NOT	$64$–$4096$ bits/subarray	GC-eDRAM uses LOGIC pulse, 3 ns latency
QCA (Akers Array)	XNOR, MUX, FA	Data-parallel in cell grid	0.23 μm² XOR, $10^6$ × lower energy
Racetrack (DWM)	Multi-input logic	$L$ bits (TRD) per TR op	TRD $\le7$ , polynomial add/mult in-memory
3D-stacked logic (PNM)	ALU, SIMD, custom	$8$–$2048$ units per channel	In-order cores co-located with DRAM banks

Key accelerative mechanisms include:

Row/column partitioning: Divides the array into $k$ sections to obtain semi-parallel execution; e.g., 32-way partitioned memristive arrays reduce $N^2$ to $N^2/k$ latency in multiplication, sorting (Leitersdorf et al., 2022, Leitersdorf et al., 2021).
Broadcast and reduction trees: Log-depth, fan-tree patterns for propagating values, popcount, or arithmetic combine (Leitersdorf et al., 2022, Leitersdorf et al., 2022).
In-row arithmetic: Carry-save add–shift, tree popcount, and bit-parallel prefix network for high-throughput multiply/divide/add, supporting up to 39× acceleration over serial implementations (Leitersdorf et al., 2021, Leitersdorf et al., 2022, Leitersdorf et al., 2022).

3. System Architectures and Programming Abstractions

PIM logic is integrated at multiple system levels:

Monolithic in-DRAM logic: Minor peripheral circuit modifications (e.g., new ACTIVATE sequences, dual-contact rows for NOT) enable PUM without area penalty, as in RowClone/Ambit (Mutlu et al., 2019, Mutlu et al., 2020).
3D-stacked PIM: Logic die beneath multiple DRAM dies hosts small PIM cores, often with local scratchpads, running code offloaded from the host. UPMEM's DPUs offer an in-order pipeline, 64KB WRAM per DPU, and direct access to local DRAM banks (Gómez-Luna et al., 2021).
Memristive/Nonvolatile PIM units: Crossbars and GC-eDRAM arrays are integrated as standalone accelerators or as part of a heterogeneous memory system (Hoffer et al., 29 Jun 2025, Leitersdorf et al., 2022).

Programming models evolve from explicit offload APIs, microcode flows, and high-level compiler toolchains:

SIMDRAM: Compiles high-level operations down to MAJ/NOT primitives and sequences ACTIVATE/PRECHARGE commands (Oliveira et al., 2022).
abstractPIM: Introduces IR/microcode separation for cross-technology portability; the IR specifies technology-agnostic gate sequences, while microcode backends specialize for in-array logic (e.g., MAGIC, FELIX, IMPLY) (Eliahu et al., 2022).
Hardware–software co-design: Dyadic block PIM (DB-PIM) marries block/bit-level pruning algorithms with specialized digital SRAM-PIM macros, maximizing parallel utilization and skipping zero blocks/columns (Duan et al., 25 May 2025).

4. Quantitative Results and Case Studies

Adoption of PIM logic yields dramatic performance and efficiency improvements across a spectrum of data-centric workloads:

Bulk copy/bitwise logic (DRAM): RowClone: $11.6\times$ lower latency, $74\times$ less energy for 4KB copy; Ambit: $44\times$ higher throughput, $35\times$ lower DRAM energy on AND/OR/NOT; $5\times$ – $12\times$ application speedup in database queries (Mutlu et al., 2020, Mutlu et al., 2019, Mutlu et al., 2019).
Memristive crossbars: MatPIM achieves $39\times$ speedup for binary matrix–vector product, $12\times$ in convolution, with log-depth tree reductions (Leitersdorf et al., 2022). MultPIM reduces 32-bit multiply latency from $O(N^{2})$ to $O(N\log N)$ , providing $4.2\times$ speedup over RIME-style partitioned designs (Leitersdorf et al., 2021).
GC-eDRAM: 99.5% logic gate success at 5μs retention, $13.5$ fJ per NOR with 4.6Mb/mm² density, outperforming 6T SRAM and 1T-1C eDRAM in both energy and throughput (Hoffer et al., 29 Jun 2025).
Real PIM hardware (UPMEM): PIM-tree index structure yields up to $70\times$ greater throughput than prior DRAM-only skip lists, maintaining low communication and load balance under skewed queries (Kang et al., 2022). DPU-based architectures outperform Xeon CPUs by up to $23\times$ and Titan V GPUs by $2.5\times$ on memory-bound graph and CRUD workloads (Gómez-Luna et al., 2021).
Arithmetic acceleration: Bit-parallel fixed/floating-point arithmetic in memristive arrays achieves $10$– $150\times$ higher throughput and up to two orders of magnitude energy efficiency gains versus modern GPUs (Leitersdorf et al., 2022).

5. Challenges, Trade-offs, and Adoption

Despite the clear gains, PIM logic introduces adoption and integration challenges at multiple levels (Mutlu et al., 2020, Ghose et al., 2018, Eliahu et al., 2022, Oliveira et al., 2022):

Device-level: Variability, endurance, and retention for NVMs (memristors, DWM, QCA, GC-eDRAM); RowHammer and ECC in DRAM; clocking, process variation in QCA.
Control and periphery: Area-efficient partition and section decoding (partitionPIM half-gates), compressed command/control messaging, retention versus array size trade-off, and managing refresh overhead in volatile devices (Leitersdorf et al., 2022, Hoffer et al., 29 Jun 2025).
System-level: Virtual memory support (in-memory pointer chasing, region-based page tables), cache coherence (speculative signature-based protocols, LazyPIM), and minimal OS/ISA extensions (Boroumand et al., 2017, Ghose et al., 2018).
Programming models: Abstractions for partitioned, parallel primitives; cross-stack co-design for bit/weight sparsity; backward-compatible IR/microcode separation to ease hardware evolution (Eliahu et al., 2022, Duan et al., 25 May 2025).
Economic practicality: Achieving high parallelism with minimal logic/peripheral additions (e.g., 1–2% array area for DRAM-based PUM, sub-10% for DWM/GC-eDRAM, small die fraction in 3D-stacked PNM).

A major bottleneck remains the transition of the programming and runtime stack from processor-centric to data-centric, requiring system and compiler support, robust toolchains (DAMOV, SIMDRAM), and suitable benchmarks for real evaluation (Oliveira et al., 2022).

6. Emerging Directions and Outlook

Recent research extends PIM logic into increasingly complex domains, including:

General in-memory arithmetic: Carry-lookahead, carry-save, floating-point (IEEE-754) fixed and variable point addition, multiplication, and division in memory (Leitersdorf et al., 2022, Leitersdorf et al., 2021).
AI/ML acceleration: Sparse digital SRAM PIM for compressed CNN inference (DB-PIM), ultrafast binarized/ternary neural nets in crossbar and gain-cell PIM (Duan et al., 25 May 2025, Hoffer et al., 29 Jun 2025).
Systematic compiler flows: Technology-independent IR to microcode translation, potentially with auto-tuning for device capabilities and dynamic reconfiguration (Eliahu et al., 2022).
PIM-specific database/data-structure kernels: Parallel index/search, graph traversal, sorting, data compression, and bitwise analytic filtering (Kang et al., 2022, Mutlu et al., 2020).
Hardware trust and security: Integration of PIM with ECC, RowHammer mitigation, and digital true random number generators (Mutlu et al., 2020).

These advances, coupled with continued demonstration of order-of-magnitude efficiency and throughput gains, position PIM logic as a central driver of future high-density, energy-efficient, and data-centric architectures. They also demand rigorous cross-layer research—from emerging device physics to software abstractions—to realize the full potential of programmable and scalable processing-in-memory systems.

Markdown Upgrade to Chat

References (17)

A Modern Primer on Processing in Memory (2020)

Processing Data Where It Makes Sense: Enabling In-Memory Computation (2019)

Stateful Logic In-Memory Using Gain-Cell eDRAM (2025)

MatPIM: Accelerating Matrix Operations with Memristive Stateful Logic (2022)

Enabling Practical Processing in and near Memory for Data-Intensive Computing (2019)

MultPIM: Fast Stateful Multiplication for Processing-in-Memory (2021)

PartitionPIM: Practical Memristive Partitions for Fast Processing-in-Memory (2022)

A Processing In-Memory Realization Using QCA: Proposal and Implementation (2016)

PIRM: Processing In Racetrack Memories (2021)

10.

Benchmarking Memory-Centric Computing Systems: Analysis of Real Processing-in-Memory Hardware (2021)

11.

AritPIM: High-Throughput In-Memory Arithmetic (2022)

12.

Methodologies, Workloads, and Tools for Processing-in-Memory: Enabling the Adoption of Data-Centric Architectures (2022)

13.

abstractPIM: A Technology Backward-Compatible Compilation Flow for Processing-In-Memory (2022)

14.

Efficient SRAM-PIM Co-design by Joint Exploration of Value-Level and Bit-Level Sparsity (2025)

15.

PIM-tree: A Skew-resistant Index for Processing-in-Memory (2022)

16.

Enabling the Adoption of Processing-in-Memory: Challenges, Mechanisms, Future Research Directions (2018)

17.

LazyPIM: Efficient Support for Cache Coherence in Processing-in-Memory Architectures (2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Processing-in-Memory (PIM) Logic.