Non-Volatile In-Memory Computing PEs
- Non-volatile in-memory computing PEs are advanced processing elements that integrate non-volatile memory with computation to eliminate traditional data movement bottlenecks.
- They leverage diverse NVM technologies like ReRAM, PCM, STT-RAM, and FeFET to enable native in-place logic and efficient multiply-accumulate operations for AI and data-intensive tasks.
- These PEs deliver significant gains in energy efficiency, latency, and scalability, making them essential for neural network acceleration, big data analytics, and edge computing.
Non-Volatile In-Memory-Computing Processing Elements (PEs) constitute a fundamental advance in computer architecture wherein memory elements based on non-volatile device technologies also serve as loci for computation, thereby fusing storage and logic to eliminate the traditional von Neumann data movement bottleneck. These PEs leverage a diversity of non-volatile memory (NVM) technologies—such as resistive RAM (ReRAM), phase-change memory (PCM), spin-transfer torque RAM (STT-RAM), ferroelectric FETs (FeFETs), and two-terminal oxide or 2D material-based devices—to realize data-centric, highly-parallel, energy-efficient, and scalable compute substrates. They enable native in-place logic, multiply-accumulate (MAC), search, and machine learning workloads, with architectural manifestations ranging from crossbar tiles for neural network acceleration to decentralized data processing units and hierarchical cache/memory organizations.
1. Architectural Principles and Core PE Structures
Non-volatile in-memory-computing PEs typically combine dense non-volatile memory cell arrays with peripheral circuits (sense amplifiers, decoders, drivers, local logic) and a lightweight controller for instruction sequencing ["An Overview of In-memory Processing with Emerging Non-volatile Memory for Data-intensive Applications" (Li et al., 2019)]. The following architectural motifs dominate:
- Crossbar-based PE: M×N crossbar array, each cross-point hosting an NVM device (e.g., RRAM, PCM, MRAM), with Ohm’s and Kirchhoff’s laws exploited for analog MAC. Peripheral DACs drive rows; bit-line currents are sensed by ADCs (Haensch et al., 2022, Chong et al., 6 Nov 2025).
- Data Processing Unit (DPU): Each DPU tightly couples NVM storage and a local ALU/neural engine. Registers and execution state are mapped within the NVM array, eradicating SRAM/DRAM hierarchy (Dubeyko, 2019).
- Hierarchical In-Memory PEs: Processing capabilities embedded throughout memory hierarchy—from high-retention STT-RAM in main memory (processing-in-memory, PiM) to relaxed-retention STT-RAM within L₁/L₂ caches (processing-in-cache, PiC), each with local compute circuits and retention-state management (Gajaria et al., 29 Jul 2024).
- Reconfigurable or Time-Domain PEs: FeFET-based content-addressable memory arrays modulate the delay of logic chains for MAC in the time domain, with precision controlled by device threshold and on-die calibration (Mattar et al., 4 Apr 2025).
Peripheral logic may include analog-to-digital converters, local accumulators, adder trees, content-address features (for associative operations), and calibration/control macros to counteract analog and device-level nonidealities.
2. Device Physics and Supported Operations
The choice of non-volatile memory device fundamentally sets the operation modes, endurance, and precision of the in-memory PE:
- Resistive devices (ReRAM/OxRAM, PCM): Multi-level cell programming supports analog weights for MAC. Ohmic linearity enables single-step vector–matrix multiply: , where is conductance and is input voltage (Haensch et al., 2022, Chong et al., 6 Nov 2025).
- STT-RAM: Exploits parallel/antiparallel magnetic tunnel junction states for bitwise logic, with retention controlled by free-layer volume/barrier. PiM with long retention serves as persistent DRAM substitute; PiC with relaxed retention minimizes cache write energy (Gajaria et al., 29 Jul 2024).
- FeFET and Piezoelectric FETs (PeFETs): Ferroelectric polarization encodes binary/multilevel weight, read non-destructively by exploiting polarization-dependent threshold shifts or, in PeFETs, via strain-induced bandgap modulation (PiER effect). This supports ternary or multilevel scalar multiplication in a single cell (Thakuria et al., 2022, Mattar et al., 4 Apr 2025).
- 2D materials (e.g., MoS₂ mem-transistors): Gate-modulated ion migration at elevated temperatures enables multi-level analog conductance states for use as synaptic weights in crossbar arrays (Mallik et al., 2023).
Supported operations include:
- Bitwise logic (AND/OR/XOR/NOR), typically via parallel row activation and sense/compare schemes
- Integer/accumulate (chained bitwise adders or PCM specific accumulators)
- Analog MAC for neural network workloads
- Associative/tag search (CAM or FeFET arrays)
- Ternary or multilevel scalar multiplication (via PeFETs or multi-level PCM/ReRAM)
3. Performance, Energy, and Endurance Metrics
Characteristic metrics arise from data and architectures:
| PE Type | Array Size | Operation | Peak Throughput | Energy/Op | Retention | Endurance |
|---|---|---|---|---|---|---|
| RRAM/PCM Crossbar | 128×128 | Analog MAC | ~10¹² MAC/s/tile | ~0.04 pJ (MAC) | ~yrs (PCM) | 10⁸–10⁹ cycles |
| STT-RAM PiC | 32 KB L₁ | Bitwise/add | ×16 32b ops/cycle | E_add ≈ 5.8 pJ | 75 μs (relaxed) | >10¹⁵ writes |
| FeFET (TD-nvIMC) | 16×8×32 | Bin. MAC (XOR/AND) | 232 GOPS | ~0.53 fJ (MAC) | >10 yrs | Not specified |
| PeFET STeP-CiM | 256×256 | Ternary MAC | ≫SRAM PeFET-NM | ↓15–91% (rel.) | >10 yrs | >10¹² cycles |
| 2D MoS₂ mem-Xbar | custom | 3-bit MAC | O(1) time/col. | 0.1–1 pJ (update) | ≥1,500 s @ 450K | >10 cycles* |
| 3D RRAM pillar | 4× (stacked) | 2–3 op. logic | — | 11 fJ (bit logic) | >10 yrs | >10⁹ (OxRAM) |
*Short retention unique to thermally-driven MoS₂ operation in mem mode.
Notable system-level results:
- PICNIC’s RRAM-PE achieves <0.5 pJ per MAC and, when integrated into a 3D cluster network for LLM inference, realizes 3.95× speedup and 30–57× energy efficiency over Nvidia A100/H100 GPUs at similar throughput (Chong et al., 6 Nov 2025).
- STeP-CiM PeFET arrays report 91% lower MAC latency and 15–91% lower energy over near-memory SRAM/PeFET designs (Thakuria et al., 2022).
- Hierarchical STT-RAM systems: PiC_L₁ yields energy reduction of 2.84× and latency speedup of 2.27× vs. CPU+STT-RAM cache baseline; PiM yields up to 20×–5.4× gains on bitwise, low-reuse workloads (Gajaria et al., 29 Jul 2024).
4. Scalability, System Integration, and Programming Model
Scalability features and integration strategies include:
- Fine-grain concurrency: PiNVSM’s DPU array co-locates compute with every data segment; transformations broadcast by keyword are applied concurrently across thousands to millions of DPUs (Dubeyko, 2019).
- Tile Array and 3D Integration: Modern designs use hundreds to thousands of PE crossbar tiles per die, with 3D stacking (logic+memory+photonic/digital dies) and silicon-photonic interconnects for chiplet-level communication and system scaling (Chong et al., 6 Nov 2025).
- Hierarchical Embedded Compute: STT-RAM PEs are allocated to caches and main memory with programmable retention, supporting dynamic trade-offs between latency, energy, and persistence (Gajaria et al., 29 Jul 2024).
- Endurance and Variability Mitigation: Reliability schemes span ECC, wear-leveling controllers, program-and-verify, and device-aware mapping of slow-sense states to fast array paths (Song et al., 2022, Li et al., 2019).
- Programming Models:
- DPUs: Keyword-based addressing. Data structures are flattened to key–value records, and queries/updates are message-broadcast.
- Crossbars: Weight programming by array pulses; compute via analog MACs; algorithms retrained to tolerate non-ideality.
- PeFETs/TD-nvIMC: Controlled by high-level MAC or associative instructions; time-domain or polarization-based inputs map directly to logic operations.
Key system-level implications include radical shifts in storage-compute semantic binding, explicit mapping of neural nets onto physical PE arrays (including activation-slicing in LLMs), and the need for new runtime/library/ISA abstractions.
5. Non-Idealities, Limitations, and Error Mitigation
The principal technical limitations and device non-idealities are:
- Device Variability and Drift: RRAM/PCM suffer cycle-to-cycle and device-to-device variations (up to 20%), drift in cell conductance, and non-linear pulse-to-weight response. Countermeasures: program-and-verify, mixed/differential encoding, retraining with noise models, and algorithmic compensation (Haensch et al., 2022, Li et al., 2019).
- IR-Drop, Sneak Paths, and Peripheral Overhead: Large crossbar arrays are limited by IR voltage drops across metal lines (attenuating signals), unwanted current leakage (sneak paths), and high area share of ADCs/DACs (up to 80% at high precision) (Haensch et al., 2022).
- Write-Endurance: PCM/ReRAM endure only 10⁸–10¹² writes; wear-leveling and usage scheduling are required.
- ADC/DAC Calibration: Continuous drift in analog periphery requires closed-loop calibration. In PICNIC, digital feedback corrects per-ADC offset, and analog peripheral macro calibrates DACs (Chong et al., 6 Nov 2025).
- Retention-Energy Tradeoff: In relaxed-retention STT-RAM PiC, lowering energy via free-layer scaling causes accelerated data decay, which must be countered by retention-aware scheduling and hardware-managed evictions (Gajaria et al., 29 Jul 2024).
- Programming and ISA Support: Required for handling new computation models (e.g., keyword-driven dataflow in PiNVSM, multi-op instructions in PiC/PiM) and exploiting array-aware data mapping.
6. Application Domains and Impact
Non-volatile in-memory PEs find broad applicability in data-intensive and AI domains:
- Neural Network Acceleration: CMOS+crossbar PE designs have demonstrated >98% software-equivalent inference accuracy on MNIST, CIFAR10. RRAM and PCM crossbars are optimized for CNN/RNN MAC workloads, including real-time LLM inference (Haensch et al., 2022, Chong et al., 6 Nov 2025).
- Big Data Analytics: The data-centric PiNVSM model enables at-memory computation over decomposed data structures (e.g., graphs, B-trees), with DPU parallelism matching data footprint (Dubeyko, 2019).
- IoT and Edge Computing: NVM PEs support power-off persistence, instant-on wake-up, and always-on inference at device level (Dubeyko, 2019, Mattar et al., 4 Apr 2025).
- Associative Search, Database, Sparse Linear Algebra: CAM-FeFET and PeFET arrays with ternary or time-domain MAC are suited to search, match, and arithmetic over ternary/binary inputs at scale (Mattar et al., 4 Apr 2025, Thakuria et al., 2022).
- Hierarchical System Optimizations: PiC in STT-RAM L₁/L₂ caches is superior where data reuse is high or CPU stalls dominate, while PiM excels in embarrassingly parallel, low-control-flow kernels (Gajaria et al., 29 Jul 2024).
7. Future Directions and Open Research Challenges
Current research identifies several frontiers:
- Materials and Device Innovations: Promising directions include electrochemical RAM (ECRAM), HfO₂ FeFETs for tightly integrated multi-level gates, SOT-MRAM for independent read/write paths, and high-precision analog conductance (Haensch et al., 2022).
- System and Circuit Co-Design: 3D stacking, chiplet clustering with dynamic power gating, and photonic interconnects for further scaling and energy reduction (Chong et al., 6 Nov 2025).
- Compiler/ISA/Runtimes: Need for automated partitioning of code segments to PiC/PiM, multi-op ISA entries, and transparent data alignment, especially in hierarchical NVM architectures (Gajaria et al., 29 Jul 2024).
- Error Robust Training: Re-training neural networks with non-ideality/noise models native to RRAM/PCM devices, employing precision-adaptive algorithms, and device-aware fine-tuning (Haensch et al., 2022).
- Thermal and IR Management: For 2D MoS₂ or 3D pillar arrays, attention to process-induced variation, low-thermal-budget BEOL, and run-time heat-dissipation is necessary (Mallik et al., 2023, Ezzadeen et al., 2020).
- Integration with Conventional CMOS: Compatibility of 2D nanomaterials, monolithic stacks with logic/process constraints, and back-end annealing processes.
- Scaling and Area-Efficiency: Pillar-based 3D vertical RRAM achieves >70× area density over planar, but integration and variability control require further process development (Ezzadeen et al., 2020).
Continued progress depends on advances at every level: device materials, array and peripheral design, architecture/runtime co-optimization, and algorithmic robustness under device-specific constraints. Non-volatile in-memory PEs represent a strategic direction for breaking the latency and energy bottlenecks of data-intensive and AI-centric computation.