Non-Volatile In-Memory Computing PEs

Updated 10 November 2025

Non-volatile in-memory computing PEs are advanced processing elements that integrate non-volatile memory with computation to eliminate traditional data movement bottlenecks.
They leverage diverse NVM technologies like ReRAM, PCM, STT-RAM, and FeFET to enable native in-place logic and efficient multiply-accumulate operations for AI and data-intensive tasks.
These PEs deliver significant gains in energy efficiency, latency, and scalability, making them essential for neural network acceleration, big data analytics, and edge computing.

Non-Volatile In-Memory-Computing Processing Elements (PEs) constitute a fundamental advance in computer architecture wherein memory elements based on non-volatile device technologies also serve as loci for computation, thereby fusing storage and logic to eliminate the traditional von Neumann data movement bottleneck. These PEs leverage a diversity of non-volatile memory (NVM) technologies—such as resistive RAM (ReRAM), phase-change memory (PCM), spin-transfer torque RAM (STT-RAM), ferroelectric FETs (FeFETs), and two-terminal oxide or 2D material-based devices—to realize data-centric, highly-parallel, energy-efficient, and scalable compute substrates. They enable native in-place logic, multiply-accumulate (MAC), search, and machine learning workloads, with architectural manifestations ranging from crossbar tiles for neural network acceleration to decentralized data processing units and hierarchical cache/memory organizations.

1. Architectural Principles and Core PE Structures

Non-volatile in-memory-computing PEs typically combine dense non-volatile memory cell arrays with peripheral circuits (sense amplifiers, decoders, drivers, local logic) and a lightweight controller for instruction sequencing ["An Overview of In-memory Processing with Emerging Non-volatile Memory for Data-intensive Applications" (Li et al., 2019)]. The following architectural motifs dominate:

Crossbar-based PE: M×N crossbar array, each cross-point hosting an NVM device (e.g., RRAM, PCM, MRAM), with Ohm’s and Kirchhoff’s laws exploited for analog MAC. Peripheral DACs drive rows; bit-line currents are sensed by ADCs (Haensch et al., 2022, Chong et al., 6 Nov 2025).
Data Processing Unit (DPU): Each DPU tightly couples NVM storage and a local ALU/neural engine. Registers and execution state are mapped within the NVM array, eradicating SRAM/DRAM hierarchy (Dubeyko, 2019).
Hierarchical In-Memory PEs: Processing capabilities embedded throughout memory hierarchy—from high-retention STT-RAM in main memory (processing-in-memory, PiM) to relaxed-retention STT-RAM within L₁/L₂ caches (processing-in-cache, PiC), each with local compute circuits and retention-state management (Gajaria et al., 29 Jul 2024).
Reconfigurable or Time-Domain PEs: FeFET-based content-addressable memory arrays modulate the delay of logic chains for MAC in the time domain, with precision controlled by device threshold and on-die calibration (Mattar et al., 4 Apr 2025).

Peripheral logic may include analog-to-digital converters, local accumulators, adder trees, content-address features (for associative operations), and calibration/control macros to counteract analog and device-level nonidealities.

2. Device Physics and Supported Operations

The choice of non-volatile memory device fundamentally sets the operation modes, endurance, and precision of the in-memory PE:

Resistive devices (ReRAM/OxRAM, PCM): Multi-level cell programming supports analog weights for MAC. Ohmic linearity enables single-step vector–matrix multiply: $I_{j} = \sum_{i} G_{ij} V_{i}$ , where $G_{ij}$ is conductance and $V_{i}$ is input voltage (Haensch et al., 2022, Chong et al., 6 Nov 2025).
STT-RAM: Exploits parallel/antiparallel magnetic tunnel junction states for bitwise logic, with retention controlled by free-layer volume/barrier. PiM with long retention serves as persistent DRAM substitute; PiC with relaxed retention minimizes cache write energy (Gajaria et al., 29 Jul 2024).
FeFET and Piezoelectric FETs (PeFETs): Ferroelectric polarization encodes binary/multilevel weight, read non-destructively by exploiting polarization-dependent threshold shifts or, in PeFETs, via strain-induced bandgap modulation (PiER effect). This supports ternary or multilevel scalar multiplication in a single cell (Thakuria et al., 2022, Mattar et al., 4 Apr 2025).
2D materials (e.g., MoS₂ mem-transistors): Gate-modulated ion migration at elevated temperatures enables multi-level analog conductance states for use as synaptic weights in crossbar arrays (Mallik et al., 2023).

Supported operations include:

Bitwise logic (AND/OR/XOR/NOR), typically via parallel row activation and sense/compare schemes
Integer/accumulate (chained bitwise adders or PCM specific accumulators)
Analog MAC for neural network workloads
Associative/tag search (CAM or FeFET arrays)
Ternary or multilevel scalar multiplication (via PeFETs or multi-level PCM/ReRAM)

3. Performance, Energy, and Endurance Metrics

Characteristic metrics arise from data and architectures:

PE Type	Array Size	Operation	Peak Throughput	Energy/Op	Retention	Endurance
RRAM/PCM Crossbar	128×128	Analog MAC	~10¹² MAC/s/tile	~0.04 pJ (MAC)	~yrs (PCM)	10⁸–10⁹ cycles
STT-RAM PiC	32 KB L₁	Bitwise/add	×16 32b ops/cycle	E_add ≈ 5.8 pJ	75 μs (relaxed)	>10¹⁵ writes
FeFET (TD-nvIMC)	16×8×32	Bin. MAC (XOR/AND)	232 GOPS	~0.53 fJ (MAC)	>10 yrs	Not specified
PeFET STeP-CiM	256×256	Ternary MAC	≫SRAM PeFET-NM	↓15–91% (rel.)	>10 yrs	>10¹² cycles
2D MoS₂ mem-Xbar	custom	3-bit MAC	O(1) time/col.	0.1–1 pJ (update)	≥1,500 s @ 450K	>10 cycles*
3D RRAM pillar	4× (stacked)	2–3 op. logic	—	11 fJ (bit logic)	>10 yrs	>10⁹ (OxRAM)

*Short retention unique to thermally-driven MoS₂ operation in mem mode.

Notable system-level results:

PICNIC’s RRAM-PE achieves <0.5 pJ per MAC and, when integrated into a 3D cluster network for LLM inference, realizes 3.95× speedup and 30–57× energy efficiency over Nvidia A100/H100 GPUs at similar throughput (Chong et al., 6 Nov 2025).
STeP-CiM PeFET arrays report 91% lower MAC latency and 15–91% lower energy over near-memory SRAM/PeFET designs (Thakuria et al., 2022).
Hierarchical STT-RAM systems: PiC_L₁ yields energy reduction of 2.84× and latency speedup of 2.27× vs. CPU+STT-RAM cache baseline; PiM yields up to 20×–5.4× gains on bitwise, low-reuse workloads (Gajaria et al., 29 Jul 2024).

4. Scalability, System Integration, and Programming Model

Scalability features and integration strategies include:

Fine-grain concurrency: PiNVSM’s DPU array co-locates compute with every data segment; transformations broadcast by keyword are applied concurrently across thousands to millions of DPUs (Dubeyko, 2019).
Tile Array and 3D Integration: Modern designs use hundreds to thousands of PE crossbar tiles per die, with 3D stacking (logic+memory+photonic/digital dies) and silicon-photonic interconnects for chiplet-level communication and system scaling (Chong et al., 6 Nov 2025).
Hierarchical Embedded Compute: STT-RAM PEs are allocated to caches and main memory with programmable retention, supporting dynamic trade-offs between latency, energy, and persistence (Gajaria et al., 29 Jul 2024).
Endurance and Variability Mitigation: Reliability schemes span ECC, wear-leveling controllers, program-and-verify, and device-aware mapping of slow-sense states to fast array paths (Song et al., 2022, Li et al., 2019).
Programming Models:
- DPUs: Keyword-based addressing. Data structures are flattened to key–value records, and queries/updates are message-broadcast.
- Crossbars: Weight programming by array pulses; compute via analog MACs; algorithms retrained to tolerate non-ideality.
- PeFETs/TD-nvIMC: Controlled by high-level MAC or associative instructions; time-domain or polarization-based inputs map directly to logic operations.

Key system-level implications include radical shifts in storage-compute semantic binding, explicit mapping of neural nets onto physical PE arrays (including activation-slicing in LLMs), and the need for new runtime/library/ISA abstractions.

5. Non-Idealities, Limitations, and Error Mitigation

The principal technical limitations and device non-idealities are:

Device Variability and Drift: RRAM/PCM suffer cycle-to-cycle and device-to-device variations (up to 20%), drift in cell conductance, and non-linear pulse-to-weight response. Countermeasures: program-and-verify, mixed/differential encoding, retraining with noise models, and algorithmic compensation (Haensch et al., 2022, Li et al., 2019).
IR-Drop, Sneak Paths, and Peripheral Overhead: Large crossbar arrays are limited by IR voltage drops across metal lines (attenuating signals), unwanted current leakage (sneak paths), and high area share of ADCs/DACs (up to 80% at high precision) (Haensch et al., 2022).
Write-Endurance: PCM/ReRAM endure only 10⁸–10¹² writes; wear-leveling and usage scheduling are required.
ADC/DAC Calibration: Continuous drift in analog periphery requires closed-loop calibration. In PICNIC, digital feedback corrects per-ADC offset, and analog peripheral macro calibrates DACs (Chong et al., 6 Nov 2025).
Retention-Energy Tradeoff: In relaxed-retention STT-RAM PiC, lowering energy via free-layer scaling causes accelerated data decay, which must be countered by retention-aware scheduling and hardware-managed evictions (Gajaria et al., 29 Jul 2024).
Programming and ISA Support: Required for handling new computation models (e.g., keyword-driven dataflow in PiNVSM, multi-op instructions in PiC/PiM) and exploiting array-aware data mapping.

6. Application Domains and Impact

Non-volatile in-memory PEs find broad applicability in data-intensive and AI domains:

Neural Network Acceleration: CMOS+crossbar PE designs have demonstrated >98% software-equivalent inference accuracy on MNIST, CIFAR10. RRAM and PCM crossbars are optimized for CNN/RNN MAC workloads, including real-time LLM inference (Haensch et al., 2022, Chong et al., 6 Nov 2025).
Big Data Analytics: The data-centric PiNVSM model enables at-memory computation over decomposed data structures (e.g., graphs, B-trees), with DPU parallelism matching data footprint (Dubeyko, 2019).
IoT and Edge Computing: NVM PEs support power-off persistence, instant-on wake-up, and always-on inference at device level (Dubeyko, 2019, Mattar et al., 4 Apr 2025).
Associative Search, Database, Sparse Linear Algebra: CAM-FeFET and PeFET arrays with ternary or time-domain MAC are suited to search, match, and arithmetic over ternary/binary inputs at scale (Mattar et al., 4 Apr 2025, Thakuria et al., 2022).
Hierarchical System Optimizations: PiC in STT-RAM L₁/L₂ caches is superior where data reuse is high or CPU stalls dominate, while PiM excels in embarrassingly parallel, low-control-flow kernels (Gajaria et al., 29 Jul 2024).

7. Future Directions and Open Research Challenges

Current research identifies several frontiers:

Materials and Device Innovations: Promising directions include electrochemical RAM (ECRAM), HfO₂ FeFETs for tightly integrated multi-level gates, SOT-MRAM for independent read/write paths, and high-precision analog conductance (Haensch et al., 2022).
System and Circuit Co-Design: 3D stacking, chiplet clustering with dynamic power gating, and photonic interconnects for further scaling and energy reduction (Chong et al., 6 Nov 2025).
Compiler/ISA/Runtimes: Need for automated partitioning of code segments to PiC/PiM, multi-op ISA entries, and transparent data alignment, especially in hierarchical NVM architectures (Gajaria et al., 29 Jul 2024).
Error Robust Training: Re-training neural networks with non-ideality/noise models native to RRAM/PCM devices, employing precision-adaptive algorithms, and device-aware fine-tuning (Haensch et al., 2022).
Thermal and IR Management: For 2D MoS₂ or 3D pillar arrays, attention to process-induced variation, low-thermal-budget BEOL, and run-time heat-dissipation is necessary (Mallik et al., 2023, Ezzadeen et al., 2020).
Integration with Conventional CMOS: Compatibility of 2D nanomaterials, monolithic stacks with logic/process constraints, and back-end annealing processes.
Scaling and Area-Efficiency: Pillar-based 3D vertical RRAM achieves >70× area density over planar, but integration and variability control require further process development (Ezzadeen et al., 2020).

Continued progress depends on advances at every level: device materials, array and peripheral design, architecture/runtime co-optimization, and algorithmic robustness under device-specific constraints. Non-volatile in-memory PEs represent a strategic direction for breaking the latency and energy bottlenecks of data-intensive and AI-centric computation.