Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 152 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 199 tok/s Pro
GPT OSS 120B 430 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Non-Volatile In-Memory Computing PEs

Updated 10 November 2025
  • Non-volatile in-memory computing PEs are advanced processing elements that integrate non-volatile memory with computation to eliminate traditional data movement bottlenecks.
  • They leverage diverse NVM technologies like ReRAM, PCM, STT-RAM, and FeFET to enable native in-place logic and efficient multiply-accumulate operations for AI and data-intensive tasks.
  • These PEs deliver significant gains in energy efficiency, latency, and scalability, making them essential for neural network acceleration, big data analytics, and edge computing.

Non-Volatile In-Memory-Computing Processing Elements (PEs) constitute a fundamental advance in computer architecture wherein memory elements based on non-volatile device technologies also serve as loci for computation, thereby fusing storage and logic to eliminate the traditional von Neumann data movement bottleneck. These PEs leverage a diversity of non-volatile memory (NVM) technologies—such as resistive RAM (ReRAM), phase-change memory (PCM), spin-transfer torque RAM (STT-RAM), ferroelectric FETs (FeFETs), and two-terminal oxide or 2D material-based devices—to realize data-centric, highly-parallel, energy-efficient, and scalable compute substrates. They enable native in-place logic, multiply-accumulate (MAC), search, and machine learning workloads, with architectural manifestations ranging from crossbar tiles for neural network acceleration to decentralized data processing units and hierarchical cache/memory organizations.

1. Architectural Principles and Core PE Structures

Non-volatile in-memory-computing PEs typically combine dense non-volatile memory cell arrays with peripheral circuits (sense amplifiers, decoders, drivers, local logic) and a lightweight controller for instruction sequencing ["An Overview of In-memory Processing with Emerging Non-volatile Memory for Data-intensive Applications" (Li et al., 2019)]. The following architectural motifs dominate:

  • Crossbar-based PE: M×N crossbar array, each cross-point hosting an NVM device (e.g., RRAM, PCM, MRAM), with Ohm’s and Kirchhoff’s laws exploited for analog MAC. Peripheral DACs drive rows; bit-line currents are sensed by ADCs (Haensch et al., 2022, Chong et al., 6 Nov 2025).
  • Data Processing Unit (DPU): Each DPU tightly couples NVM storage and a local ALU/neural engine. Registers and execution state are mapped within the NVM array, eradicating SRAM/DRAM hierarchy (Dubeyko, 2019).
  • Hierarchical In-Memory PEs: Processing capabilities embedded throughout memory hierarchy—from high-retention STT-RAM in main memory (processing-in-memory, PiM) to relaxed-retention STT-RAM within L₁/L₂ caches (processing-in-cache, PiC), each with local compute circuits and retention-state management (Gajaria et al., 29 Jul 2024).
  • Reconfigurable or Time-Domain PEs: FeFET-based content-addressable memory arrays modulate the delay of logic chains for MAC in the time domain, with precision controlled by device threshold and on-die calibration (Mattar et al., 4 Apr 2025).

Peripheral logic may include analog-to-digital converters, local accumulators, adder trees, content-address features (for associative operations), and calibration/control macros to counteract analog and device-level nonidealities.

2. Device Physics and Supported Operations

The choice of non-volatile memory device fundamentally sets the operation modes, endurance, and precision of the in-memory PE:

  • Resistive devices (ReRAM/OxRAM, PCM): Multi-level cell programming supports analog weights for MAC. Ohmic linearity enables single-step vector–matrix multiply: Ij=iGijViI_{j} = \sum_{i} G_{ij} V_{i}, where GijG_{ij} is conductance and ViV_{i} is input voltage (Haensch et al., 2022, Chong et al., 6 Nov 2025).
  • STT-RAM: Exploits parallel/antiparallel magnetic tunnel junction states for bitwise logic, with retention controlled by free-layer volume/barrier. PiM with long retention serves as persistent DRAM substitute; PiC with relaxed retention minimizes cache write energy (Gajaria et al., 29 Jul 2024).
  • FeFET and Piezoelectric FETs (PeFETs): Ferroelectric polarization encodes binary/multilevel weight, read non-destructively by exploiting polarization-dependent threshold shifts or, in PeFETs, via strain-induced bandgap modulation (PiER effect). This supports ternary or multilevel scalar multiplication in a single cell (Thakuria et al., 2022, Mattar et al., 4 Apr 2025).
  • 2D materials (e.g., MoS₂ mem-transistors): Gate-modulated ion migration at elevated temperatures enables multi-level analog conductance states for use as synaptic weights in crossbar arrays (Mallik et al., 2023).

Supported operations include:

  • Bitwise logic (AND/OR/XOR/NOR), typically via parallel row activation and sense/compare schemes
  • Integer/accumulate (chained bitwise adders or PCM specific accumulators)
  • Analog MAC for neural network workloads
  • Associative/tag search (CAM or FeFET arrays)
  • Ternary or multilevel scalar multiplication (via PeFETs or multi-level PCM/ReRAM)

3. Performance, Energy, and Endurance Metrics

Characteristic metrics arise from data and architectures:

PE Type Array Size Operation Peak Throughput Energy/Op Retention Endurance
RRAM/PCM Crossbar 128×128 Analog MAC ~10¹² MAC/s/tile ~0.04 pJ (MAC) ~yrs (PCM) 10⁸–10⁹ cycles
STT-RAM PiC 32 KB L₁ Bitwise/add ×16 32b ops/cycle E_add ≈ 5.8 pJ 75 μs (relaxed) >10¹⁵ writes
FeFET (TD-nvIMC) 16×8×32 Bin. MAC (XOR/AND) 232 GOPS ~0.53 fJ (MAC) >10 yrs Not specified
PeFET STeP-CiM 256×256 Ternary MAC ≫SRAM PeFET-NM ↓15–91% (rel.) >10 yrs >10¹² cycles
2D MoS₂ mem-Xbar custom 3-bit MAC O(1) time/col. 0.1–1 pJ (update) ≥1,500 s @ 450K >10 cycles*
3D RRAM pillar 4× (stacked) 2–3 op. logic 11 fJ (bit logic) >10 yrs >10⁹ (OxRAM)

*Short retention unique to thermally-driven MoS₂ operation in mem mode.

Notable system-level results:

  • PICNIC’s RRAM-PE achieves <0.5 pJ per MAC and, when integrated into a 3D cluster network for LLM inference, realizes 3.95× speedup and 30–57× energy efficiency over Nvidia A100/H100 GPUs at similar throughput (Chong et al., 6 Nov 2025).
  • STeP-CiM PeFET arrays report 91% lower MAC latency and 15–91% lower energy over near-memory SRAM/PeFET designs (Thakuria et al., 2022).
  • Hierarchical STT-RAM systems: PiC_L₁ yields energy reduction of 2.84× and latency speedup of 2.27× vs. CPU+STT-RAM cache baseline; PiM yields up to 20×–5.4× gains on bitwise, low-reuse workloads (Gajaria et al., 29 Jul 2024).

4. Scalability, System Integration, and Programming Model

Scalability features and integration strategies include:

  • Fine-grain concurrency: PiNVSM’s DPU array co-locates compute with every data segment; transformations broadcast by keyword are applied concurrently across thousands to millions of DPUs (Dubeyko, 2019).
  • Tile Array and 3D Integration: Modern designs use hundreds to thousands of PE crossbar tiles per die, with 3D stacking (logic+memory+photonic/digital dies) and silicon-photonic interconnects for chiplet-level communication and system scaling (Chong et al., 6 Nov 2025).
  • Hierarchical Embedded Compute: STT-RAM PEs are allocated to caches and main memory with programmable retention, supporting dynamic trade-offs between latency, energy, and persistence (Gajaria et al., 29 Jul 2024).
  • Endurance and Variability Mitigation: Reliability schemes span ECC, wear-leveling controllers, program-and-verify, and device-aware mapping of slow-sense states to fast array paths (Song et al., 2022, Li et al., 2019).
  • Programming Models:
    • DPUs: Keyword-based addressing. Data structures are flattened to key–value records, and queries/updates are message-broadcast.
    • Crossbars: Weight programming by array pulses; compute via analog MACs; algorithms retrained to tolerate non-ideality.
    • PeFETs/TD-nvIMC: Controlled by high-level MAC or associative instructions; time-domain or polarization-based inputs map directly to logic operations.

Key system-level implications include radical shifts in storage-compute semantic binding, explicit mapping of neural nets onto physical PE arrays (including activation-slicing in LLMs), and the need for new runtime/library/ISA abstractions.

5. Non-Idealities, Limitations, and Error Mitigation

The principal technical limitations and device non-idealities are:

  • Device Variability and Drift: RRAM/PCM suffer cycle-to-cycle and device-to-device variations (up to 20%), drift in cell conductance, and non-linear pulse-to-weight response. Countermeasures: program-and-verify, mixed/differential encoding, retraining with noise models, and algorithmic compensation (Haensch et al., 2022, Li et al., 2019).
  • IR-Drop, Sneak Paths, and Peripheral Overhead: Large crossbar arrays are limited by IR voltage drops across metal lines (attenuating signals), unwanted current leakage (sneak paths), and high area share of ADCs/DACs (up to 80% at high precision) (Haensch et al., 2022).
  • Write-Endurance: PCM/ReRAM endure only 10⁸–10¹² writes; wear-leveling and usage scheduling are required.
  • ADC/DAC Calibration: Continuous drift in analog periphery requires closed-loop calibration. In PICNIC, digital feedback corrects per-ADC offset, and analog peripheral macro calibrates DACs (Chong et al., 6 Nov 2025).
  • Retention-Energy Tradeoff: In relaxed-retention STT-RAM PiC, lowering energy via free-layer scaling causes accelerated data decay, which must be countered by retention-aware scheduling and hardware-managed evictions (Gajaria et al., 29 Jul 2024).
  • Programming and ISA Support: Required for handling new computation models (e.g., keyword-driven dataflow in PiNVSM, multi-op instructions in PiC/PiM) and exploiting array-aware data mapping.

6. Application Domains and Impact

Non-volatile in-memory PEs find broad applicability in data-intensive and AI domains:

  • Neural Network Acceleration: CMOS+crossbar PE designs have demonstrated >98% software-equivalent inference accuracy on MNIST, CIFAR10. RRAM and PCM crossbars are optimized for CNN/RNN MAC workloads, including real-time LLM inference (Haensch et al., 2022, Chong et al., 6 Nov 2025).
  • Big Data Analytics: The data-centric PiNVSM model enables at-memory computation over decomposed data structures (e.g., graphs, B-trees), with DPU parallelism matching data footprint (Dubeyko, 2019).
  • IoT and Edge Computing: NVM PEs support power-off persistence, instant-on wake-up, and always-on inference at device level (Dubeyko, 2019, Mattar et al., 4 Apr 2025).
  • Associative Search, Database, Sparse Linear Algebra: CAM-FeFET and PeFET arrays with ternary or time-domain MAC are suited to search, match, and arithmetic over ternary/binary inputs at scale (Mattar et al., 4 Apr 2025, Thakuria et al., 2022).
  • Hierarchical System Optimizations: PiC in STT-RAM L₁/L₂ caches is superior where data reuse is high or CPU stalls dominate, while PiM excels in embarrassingly parallel, low-control-flow kernels (Gajaria et al., 29 Jul 2024).

7. Future Directions and Open Research Challenges

Current research identifies several frontiers:

  • Materials and Device Innovations: Promising directions include electrochemical RAM (ECRAM), HfO₂ FeFETs for tightly integrated multi-level gates, SOT-MRAM for independent read/write paths, and high-precision analog conductance (Haensch et al., 2022).
  • System and Circuit Co-Design: 3D stacking, chiplet clustering with dynamic power gating, and photonic interconnects for further scaling and energy reduction (Chong et al., 6 Nov 2025).
  • Compiler/ISA/Runtimes: Need for automated partitioning of code segments to PiC/PiM, multi-op ISA entries, and transparent data alignment, especially in hierarchical NVM architectures (Gajaria et al., 29 Jul 2024).
  • Error Robust Training: Re-training neural networks with non-ideality/noise models native to RRAM/PCM devices, employing precision-adaptive algorithms, and device-aware fine-tuning (Haensch et al., 2022).
  • Thermal and IR Management: For 2D MoS₂ or 3D pillar arrays, attention to process-induced variation, low-thermal-budget BEOL, and run-time heat-dissipation is necessary (Mallik et al., 2023, Ezzadeen et al., 2020).
  • Integration with Conventional CMOS: Compatibility of 2D nanomaterials, monolithic stacks with logic/process constraints, and back-end annealing processes.
  • Scaling and Area-Efficiency: Pillar-based 3D vertical RRAM achieves >70× area density over planar, but integration and variability control require further process development (Ezzadeen et al., 2020).

Continued progress depends on advances at every level: device materials, array and peripheral design, architecture/runtime co-optimization, and algorithmic robustness under device-specific constraints. Non-volatile in-memory PEs represent a strategic direction for breaking the latency and energy bottlenecks of data-intensive and AI-centric computation.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Non-Volatile In-Memory-Computing Processing Elements (PEs).