Processing-Near-Memory Architectures
- Processing-near-memory is a computing paradigm that integrates digital logic with memory to minimize data movement and enhance bandwidth utilization.
- It employs strategies like 3D-stacked DRAM, DIMM-level processing, and SRAM-peripheral acceleration to support applications in genomics, AI/ML, and big data.
- PNM architectures achieve significant energy savings and speedups by processing data in-place, thereby mitigating the traditional memory wall.
Processing-near-memory (PNM), also commonly referred to as near-memory processing (NMP) or near-memory computing (NMC), is an architectural paradigm that places digital computational logic in close electrical and physical proximity to memory devices. By orchestrating the colocation of programmable processing elements (PEs), accelerators, or simple cores adjacent to memory arrays—most commonly in the logic layer of 3D-stacked DRAM, in the periphery of conventional DRAM/SRAM banks, or on the buffer chips of DIMMs—PNM aims to drastically reduce data movement costs, better utilize internal bandwidth, and achieve orders-of-magnitude improvements in energy efficiency and performance for data-intensive, memory-bound workloads. Unlike in-memory computing (IMC, or PuM), which leverages the intrinsic analog or circuit properties of memory cells for specialized operations, PNM/NMP architectures rely on embedded digital logic and support general-purpose or domain-tailored computation via conventional instruction sets or offloaded kernels.
1. Architectural Foundations and Taxonomy
PNM architectures are characterized by the explicit addition of digital logic, ranging from lightweight in-order cores, SIMD/vector engines, custom accelerators, to specialized processing units placed close to DRAM or emerging non-volatile memories (NVMs). Three principal integration strategies have emerged:
- 3D-Stacked DRAM with Logic Layer: Modern devices like High-Bandwidth Memory (HBM2/3), Hybrid Memory Cube (HMC), and UPMEM integrate a “logic die” beneath multiple DRAM layers using through-silicon vias (TSVs) offering internal bandwidths on the order of hundreds of GB/s per stack (Mutlu et al., 2020). Processing elements (PEs) directly interface with memory banks, enabling efficient execution of offloaded kernels such as vector-matrix operations, graph analytics, or hashes.
- DIMM- or Module-Level Processing: Buffer chips on DIMMs (as in RecNMP (Ke et al., 2019) and NMP-PaK (Kim et al., 12 May 2025)) or periphery logic in “smart” DRAM chips (e.g., UPMEM’s DPUs) perform gather-reduce, arithmetic, and filtering operations in hardware, exposing a standard DDR interface to the host.
- SRAM-Peripheral Near-Memory Acceleration: In edge and embedded domains, custom NMP macros, e.g., NM-Caesar and NM-Carus, integrate ALUs or vector units on the SRAM periphery for low-latency, energy-proportional inference and signal processing, leveraging standard memory interface protocols (Caon et al., 2024).
A high-level taxonomy is as follows:
| Integration Site | Compute Type | Example Architectures |
|---|---|---|
| Logic Layer (3D stack) | Generic or DSA | HMC, HBM-PIM, UPMEM DPUs |
| Module buffer, DIMM periphery | Fixed/parametric | RecNMP, NMP-PaK, AxDIMM |
| SRAM periphery (on-chip SRAM) | Digital NMC | NM-Caesar, NM-Carus |
| Interposer/MCP | ASIC/FPGA fabric | APACHE (FHE), BLADE |
PNM solutions differ fundamentally from PuM/IMC in supporting programmable operations with full digital logic, rich runtime programmability, and system-level resource management.
2. Key Principles and Performance Motivations
The primary bottleneck motivating PNM is the dominant cost—both in energy and latency—of moving data between conventional memory arrays and separate processors (the so-called "memory wall") (Mutlu et al., 2019). Quantitative analyses show that moving a 64-byte cache line from DRAM can consume ∼200 pJ, whereas a 64-bit integer operation in a CPU or lightweight PE consumes ≈1 pJ. Bandwidth imbalances exacerbate this: traditional DDR4-3200 offers ∼25 GB/s/channel (with tens of ns latency), while the internal bandwidth of 3D-stacked memory exceeds 400–800 GB/s/stack (and <10–20 ns latency from array to logic die) (Mutlu et al., 2020, Singh et al., 2019).
Migrating computation near memory yields:
- Substantially higher memory bandwidth utilization (e.g., 5.2%→44% in NMP-PaK (Kim et al., 12 May 2025))
- Reduced stall time and effective elimination of memory channel bottlenecks in memory-bound pipelines (Iterative Compaction step, etc.)
- Large reductions in total memory traffic due to in-place filtering, aggregation, or batchwise computation
For data-centric AI/ML and database workloads, PNM architectures achieve 10–50× speedups and 5–35× reductions in energy compared to compute-remote baselines, with bandwidth efficiency routinely >80% (Mutlu et al., 2020, Ke et al., 2019, Mutlu et al., 2019).
3. Exemplary Microarchitectures and Domain-Specific Adaptations
A range of PNM microarchitectures have been proposed and prototyped to address specific domains:
- Irregular Graph and Genomics Workloads: NMP-PaK implements channel-level NMP engines, with 3-stage pipelined PEs and scratchpad buffers to address the high memory-footprint and irregular access patterns of de novo genome assembly. It achieves 14× memory footprint reduction and a 16× speedup over state-of-the-art CPU baselines, with a measured throughput of 8.3× that of distributed supercomputing clusters on a per-resource basis (Kim et al., 12 May 2025).
- Edge AI and Embedded Signal Processing: The NM-Caesar and NM-Carus macros provide RISC-V programmable, area-and-energy-efficient near-memory computing for tinyML and DNN inference at the edge. NM-Carus achieves 306.7 GOPS/W in 8-bit matmul, sustaining 50× throughput and 33× energy reduction versus RV32IMC (Caon et al., 2024).
- Sparse Vector Analytics: SpANNS employs near-memory compute-enabled DIMMs on a CXL Type-2 platform, orchestrated by a controller for efficient sparse ANNS over hybrid inverted indices. It outperforms CPU baselines 15.2×–21.6× for high-dimensional IR, via in-place SpMV, priority-queue filtering, and direct in-DIMM inner product (Zhang et al., 6 Jan 2026).
- Database and Big Data Query Acceleration: Migratory NMP architectures position lightweight RISC PEs in each memory node and execute SELECT/JOIN queries by migrating small thread contexts rather than bulk data, yielding 10³–10⁵× speedups in TPC-style analytic queries by minimizing data movement (Upchurch, 2020).
- Homomorphic Encryption: APACHE delivers multi-scheme FHE acceleration by layering external I/O, near-memory compute, and in-array in-memory reduction, achieving up to 35.47× throughput relative to state-of-the-art ASIC accelerators (Ding et al., 2024).
4. Programming Models, Coherence, and Resource Management
PNM architectures require new system- and application-level abstractions for efficient offload, scheduling, data placement, and coherence:
- Resource Management: Surveyed strategies include static code-annotation/ISA extensions, compiler-driven kernel identification (profile-based), and online or ML-driven offload scheduling (random forest, MABs, RL). Policies optimize for bandwidth, power, thermal, and coherence constraints (Khan et al., 2020, Corda et al., 2021, Majumder et al., 2021, Pandey et al., 2023).
- Memory Allocation and Data Placement: Graph- or communication-aware strategies colocate highly-communicating data structures with appropriate PNM engines, sometimes leveraging graph-partitioners or RL agents for continual adaptation (Majumder et al., 2021).
- Coherence Mechanisms: Coarse-grained (block-level) and adaptive rollback coherence models (e.g., CONDA, MRCN) reduce coherence overheads by amortizing transactional checks, with hardware conflict maps and sub-block rollback for enhanced throughput (up to 25% improvement over CONDA) (Kabat et al., 2023).
- Interface and Programming: PNM macro ISAs expose computational primitives via memory-mapped registers, extended DRAM/SRAM opcodes, or host-side TI DMA/burst interfaces. Compiler and toolchain support (LLVM/Polly, OpenMP extensions, custom GCC backends) is gradually being standardized (Oliveira et al., 2022, Caon et al., 2024).
- Practical Profiling and Suitability Prediction: Tools such as NMPO enable host-only prediction of PNM suitability for kernels using a small set of hardware counters, offering 85.6% classification accuracy and orders-of-magnitude faster profiling vs. simulation (Corda et al., 2021).
5. Quantitative Impact and Empirical Results
Key empirical measurements from the cited architectures demonstrate the transformative impact of PNM:
| Domain / Architecture | Speedup (vs. baseline) | Energy Reduction | Memory Footprint |
|---|---|---|---|
| NMP-PaK (de novo assembly) | 16× | 14× smaller mem | 2.4× fewer mem ops |
| RecNMP (recommendation) | 4.2× | 45.8% savings | SLS latency ↓9.8× |
| NM-Carus (TinyML) | 53.9× execution time | 35.6× energy | 306.7 GOPS/W peak eff. |
| SpANNS (sparse IR/ANNS) | 15.2×–21.6× | – | QPS scales linearly |
| APACHE (FHE, multi-scheme) | 10.6×–35.5× | – | NTT/compute util. >90% |
| Migratory NMP (DB SELECT/JOIN) | 10³–10⁵× | – | Intra-mem traffic only |
Notably, data movement is minimized, compute utilization and memory bandwidth are highly improved, and overall system throughput often approaches or exceeds that of large-scale distributed alternatives under the same resource constraints.
6. Limitations, Challenges, and Future Directions
Despite these gains, PNM adoption faces several ongoing technical challenges:
- Thermal Management: 3D stacking raises power density issues in logic layers; future packaging must integrate active/TSV cooling and adaptive throttling (Mutlu et al., 2020).
- Coherence Overheads: Balancing coherence granularity and rollback costs is nontrivial, especially with fine-grained sharing; hardware transactional models and conflict maps (MRCN) help, but scalability in multi-tenant or highly dynamic workloads remains an open challenge (Kabat et al., 2023).
- Generality and Programmability: PNM is highly architecture- and workload-specific; cross-platform compiler support, standardization of offload/call APIs, and robust runtime management frameworks remain underdeveloped (Oliveira et al., 2022, Khan et al., 2020).
- Data Placement and Adaptation: Efficient mapping of data/computation requires continual adaptation (e.g., RL or MAB agents), especially as working sets shift or in multi-tenant environments (Majumder et al., 2021, Pandey et al., 2023).
- Emerging Device Integration: Leveraging non-volatile memories (e.g., RTM in FIRM (Hameed et al., 2022)) introduces unique mapping and access-scheduling considerations (e.g., shift minimization, SALP exploitation), but promises radical improvements in background/standby power.
Standardization of PNM interfaces (e.g., CXL-type modules), robust programming systems, and thermal/mechanical design innovation are central to future maturation and deployment of PNM architectures, especially for distributed AI/ML, genomics, high-speed analytics, and edge inference applications.
7. Cross-Domain Applications and Research Trajectories
Processing-near-memory has achieved significant impact in genomics (NMP-PaK, FIRM), machine learning inference (NM-Carus, RecNMP), large-scale similarity search and IR (SpANNS), database acceleration (MNMS), homomorphic encryption (APACHE), and storage-class memory for crash-consistent systems (NearPM (Seneviratne et al., 2022)). Methodologies for PNM-aware workload characterization (e.g., DAMOV), domain-specific accelerators, and ML-enhanced resource management demonstrate that PNM is migrating from research prototype to practical deployment.
A plausible implication is the continuing convergence of storage-class memory hierarchies, distributed memory-centric computation, and domain-specialized PNM fabric. These trends, coupled with toolchain and interface standardization, will likely shape the next generation of memory- and data-centric computer systems for scientific computing, datacenter-scale AI, and beyond (Mutlu, 2023).