Papers
Topics
Authors
Recent
Search
2000 character limit reached

Processing Near Memory Architectures

Updated 22 June 2026
  • Processing Near Memory (PNM) is an architectural paradigm that embeds computational capability within or near memory to reduce latency, energy, and bandwidth bottlenecks.
  • It encompasses techniques like PUM that leverage intrinsic analog and digital properties, and PnM that adds programmable logic adjacent to memory for in-place computation.
  • PNM systems demonstrate substantial performance and efficiency gains, with examples showing up to 34× speedup and significant energy reductions over conventional architectures.

Processing Near Memory (PNM) is an architectural paradigm that integrates computational capability directly within or adjacent to memory modules to minimize the energy and latency costs of data movement between processor and memory subsystems. PNM encompasses a spectrum of techniques, including processing-using-memory (PUM)—which exploits the intrinsic analog or device-level physics of memory arrays for computation—and processing-near-memory (PnM), which adds programmable logic close to the memory array. This approach fundamentally addresses bottlenecks in traditional von Neumann architectures by enabling computation to occur close to, or directly within, memory, thereby leveraging the massive internal bandwidth and parallelism inherent in memory chips while reducing or eliminating the need to shuttle data over bandwidth-limited memory buses (Mutlu et al., 2020, Oliveira et al., 2024, Oliveira et al., 2022).

1. Architectural Principles of Processing Near Memory

Processing Near Memory architectures are characterized by the physical co-location of computation logic with memory storage elements. In DRAM-based PNM/PUM systems, computation can be mapped in two main ways:

  • Processing-Using-Memory (PUM): Computation is realized entirely within the native memory array by harnessing analog effects (e.g., charge sharing in DRAM bitlines and sense amplifiers) or resistive characteristics (e.g., current summing in crossbar ReRAM/PCM cells). Typical primitives include bulk bitwise operations (AND/OR/NOT/MAJ in DRAM (Mutlu et al., 2020, Oliveira et al., 2022)), vector-matrix multiplications in crossbar NVMs, and copy/initialization via internal DRAM commands (RowClone (Seshadri et al., 2016)).
  • Processing-Near-Memory (PnM): Computation logic (e.g., simple RISC cores, ALUs, SIMD engines) is placed alongside the memory array (e.g., in 3D-stacked DRAM logic layers or on-die adjacent to banks), allowing software-programmable kernels with greater functional flexibility but at higher area and power cost (Mutlu et al., 2020, Mutlu et al., 2019).

The primary goal of PNM is to eliminate the memory wall by executing operations “in-place” or “in-situ,” thus amortizing memory access cost over massive and parallelized in-array computation.

2. Hardware Mechanisms and Substrate Innovations

DRAM-Based Processing-Using-Memory

  • Bulk Bitwise and Data Movement Primitives: Ambit-style triple-row activation decodes majority logic for AND/OR operations by combining charge from three DRAM rows on a shared bitline. RowClone leverages back-to-back activates to copy or zero rows entirely inside the memory chip (Seshadri et al., 2016, Mutlu et al., 2020).
  • Fine-Grained Array Partitioning: MIMDRAM introduces mat-level isolation within subarrays, enabling software-controlled MIMD parallelism at the granularity of 512–1024 bits via added wordline isolation transistors and latches (Oliveira et al., 2024).
  • Lookup Table (LUT) Architectures: PuM-LUT approaches such as pLUTo and Lama integrate per-mat LUTs and independent column selection logic, providing scalable support for multi-bit (e.g., 8-bit) arithmetic and arbitrary deterministic functions, attaining significantly reduced ACT (activation) commands and increased operand precision (Ferreira et al., 2021, Khabbazan et al., 4 Feb 2025).
  • In-DRAM Data Movement: Low-cost inter-mat or inter-subarray copy mechanisms (e.g., MIMDRAM’s GB-MOV/LC-MOV) enable intra-DRAM communication without egress to the memory controller.

NVM and Emerging Devices

  • Crossbar Computation: MRAM, ReRAM, and PCM arrays realize analog vector-matrix multiply (MVM) through Ohm’s and Kirchhoff’s laws. Operations such as y=Axy = Ax leverage device conductance for O(1)-time accumulation per bitline (Fernandez et al., 2022, Wong et al., 17 Feb 2026).
  • Hybrid Analog–Digital PUM: DARTH-PUM demonstrates integration of analog and digital PUM units in one tile, facilitating both high-throughput MVM (analog) and general Boolean/bit-serial arithmetic (digital) through shared peripheral resources and mode arbitration (Wong et al., 17 Feb 2026).

3. Software Stack, Programming Models, and OS Support

  • Extended ISA and Compiler Passes: MIMDRAM and related systems feature custom bbop ISA extensions for registration, operand placement, and fine-grained operator execution. Compiler toolchains (e.g., LLVM passes) perform loop vectorization, dependency graph scheduling, and mapping to physical memory mats/subarrays for maximal MIMD utilization (Oliveira et al., 2024, Oliveira, 27 Aug 2025).
  • Pattern-Based High-Level Frameworks: Data-parallel frameworks such as DaPPA and SimplePIM expose map/reduce/zip/scan/stencil primitives, compiling directly into PNM/PUM kernels, while automatically managing data partitioning, memory placement, and scheduling (Chen et al., 2023, Oliveira, 27 Aug 2025).
  • OS and Allocation Support: Physical operand placement is non-trivial; PUM primitives require that all operands reside within the same subarray and are row-aligned. Kernel-level allocators (PUMA) implement subarray-aware, worst-fit, and alignment-guaranteed memory allocation, resolving placement constraints that generic malloc or huge pages cannot satisfy (Oliveira et al., 2024).
Software Layer Key Functionality Example Reference
ISA Extensions In-memory ops, operand placement, data movement bbop_* instructions (Oliveira et al., 2024)
Compiler IR/Passes Loop vectorization, DAG scheduling, pattern expansion LLVM passes, DaPPA (Oliveira, 27 Aug 2025)
OS Allocators Subarray-, row-aligned, PUM-aware memory allocation PUMA (Oliveira et al., 2024)
Runtime Libraries Primitives for data movement and task orchestration SimplePIM (Chen et al., 2023)

4. Quantitative Performance and Energy Results

State-of-the-art PNM/PUM systems achieve substantial speedup and energy efficiency improvements over conventional and even GPU baselines, subject to workload suitability.

  • MIMDRAM: 34× performance, 14.3× energy efficiency over SIMDRAM; 30.6× over CPU, 6.8× over GPU. SIMD utilization is 15.6× higher than SIMDRAM. Up to 13.2× CPU speedup at full bank/subarray parallelism, constrained primarily by multiply/divide throughput and reduction operations (Oliveira et al., 2024).
  • LUT-Based Architectures (Lama): 8.5× performance, 6.9× energy efficiency over previous PuM for bulk 8-bit multiplication; 4.8–9.8× speedup and 5.8–19.2× energy reduction over GPU/TPU for attention-based model inference (Khabbazan et al., 4 Feb 2025).
  • MRAM PUM (MATSA): 6.15–7.35× performance and 4.21–11.3× energy gains over CPU/GPU/PNM on dynamic time warping, attributed to full in-memory dataflow and crossbar-level vector parallelism (Fernandez et al., 2022).
  • Hybrid Architectures (DARTH-PUM): 14.8–59.4× application speedup and 39.6–110.7× energy efficiency over analog+CPU baselines for neural, cryptographic, and LLM encoder workloads, enabled by mixed analog (MVM) and digital (bit-serial logic) units (Wong et al., 17 Feb 2026).
  • Area Overheads: Leading DRAM-based fine-grain PUD substrates add 1–2% array area; LUT-based schemes (Lama) report ≈2.5% area overhead per stack; NVM-based tiles are dominated by ADC/DAC periphery costs rather than array area (Oliveira et al., 2024, Khabbazan et al., 4 Feb 2025, Wong et al., 17 Feb 2026).

5. System-Level Integration, Programming Challenges, and Methodological Tools

  • Physical Alignment and Memory Mapping: End-to-end system frameworks (PiDRAM, PUMA) illustrate the need for integrative hardware/software/OS support to maintain operand alignment within subarrays, ensuring that in-DRAM primitives fire successfully (Olgun et al., 2021, Oliveira et al., 2024). This includes modified system calls, subarray mapping tables, and custom allocation APIs.
  • Coherence and Consistency: LazyPIM-style coherence models and explicit cache flush/invalidate mechanisms are required to synchronize in-memory updates with CPU caches. Software frameworks often enforce PUM regions to remain un-cached or apply batch synchronization (Mutlu et al., 2020, Seshadri et al., 2016, Olgun et al., 2021).
  • Bottleneck Classification and Benchmarking: DAMOV characterizes bottleneck classes via locality, arithmetic intensity, and cache-miss patterns, guiding offload decisions suited for PNM/PUM acceleration. Performance is quantifiable both in theoretical roofline models and microbenchmarks; suites like PrIM and DAMOV are public (Oliveira, 27 Aug 2025, Gómez-Luna et al., 2021).
  • Energy and Density Limiting Factors: Non-ideal device effects (e.g., sense amp noise, resistance drift, ADC/DAC energy in analog NVM, finite mat/bank sizes) are limiting factors for both DRAM and NVM substrates, affecting operation reliability and overall system scalability (Wong et al., 17 Feb 2026, Yuksel et al., 2023).

6. Limitations, Open Problems, and Future Directions

  • Operand Placement and Scalability: Current PNM requires strict operand colocation and alignment, which poses challenges for irregular or pointer-heavy workloads. Proposed avenues include runtime prefetchers, hardware address translation, and enhanced mat-range schedulers (Oliveira et al., 2024, Oliveira et al., 2024).
  • Reduction and Multiplication Bottlenecks: DRAM-embedded reductions remain expensive at scale due to data movement and bit-serial logic constraints. Hybrid bit-parallel/near-memory ALUs, or further diversification of analog/digital execution units, are open research topics (Oliveira et al., 2024, Wong et al., 17 Feb 2026).
  • Compiler and Programming Model Generality: Existing compilers support mainly affine (regular) loop structures; full generality (pointer-chasing, arbitrary recursion) is an outstanding problem. Pattern-based frameworks such as DaPPA only partially abstract hardware nuances (Oliveira, 27 Aug 2025).
  • Security and Fault Isolation: PUM units may share physical mats or arrays, raising questions of strong software/hardware isolation, secure task execution, and DRAM-level DRM support. Isolation at mat or subarray granularity is an emergent research focus (Oliveira et al., 2024).
  • Application Domain Expansion: While current PNM/PUM excels at map/reduce-heavy, memory-intensive, and data-parallel kernels (DBOLAP, DNN inference, genomics, time series), extending practical benefits to control-intensive or irregular compute workloads remains challenging (Fernandez et al., 2022, Mutlu et al., 2020).

7. Representative Applications and Impact

  • Operating Systems and Database Workloads: Fork, zero-fill, memcpy, and range-predicate acceleration by in-memory bulk copy/AND/OR yields up to 4–66× system improvements on key microbenchmarks and real-world OS/database kernels (Mutlu et al., 2020, Seshadri et al., 2016).
  • Machine Learning and Model Inference: Matrix multiplication, CNN/LLM kernels, and quantized models map efficiently to analog/digital hybrid PUM, achieving order-of-magnitude energy and performance gains (Khabbazan et al., 4 Feb 2025, Wong et al., 17 Feb 2026).
  • Bioinformatics and Graph Analytics: Seed filtering, BFS, and dense bitmask/adjacency kernels benefit from row-parallel PUM/SIMD substrates (Oliveira et al., 2022, Fernandez et al., 2022).
  • Programming Productivity: High-level runtimes enable 3–6× reduction in effective PIM code and 10–43% speedups over hand-refined PIM software, as evidenced by SimplePIM/UPMEM evaluations (Chen et al., 2023, Gómez-Luna et al., 2021).

Processing Near Memory constitutes a significant architectural trend enabling high-throughput, energy-efficient, and scalable computation for data-centric applications by moving computation to where data naturally resides. Major advances in fine-grained mat-level partitioning, LUT- and crossbar-based in-memory computation, hybrid analog-digital integration, and comprehensive toolchains for software support are converging to make PNM/PUM architectures increasingly general, high-performance, and accessible for a broad spectrum of workloads (Oliveira et al., 2024, Oliveira, 27 Aug 2025, Khabbazan et al., 4 Feb 2025, Wong et al., 17 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (15)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Processing Near Memory (PNM).