3D NAND Flash Processing-in-Memory
- 3D NAND Flash PIM is a computational architecture that integrates processing capabilities directly within dense vertical flash arrays.
- It leverages advanced device engineering and analog circuit techniques to perform vector, Boolean, and neural network operations in-memory.
- The approach enables energy-efficient large language model inference and associative search while mitigating data movement challenges.
3D NAND Flash Processing-in-Memory (PIM) encompasses a class of computational architectures and circuit-level methodologies that leverage the inherent device and array structure of vertical 3D NAND flash memories to perform computation within the storage subsystem. This approach collapses the memory–compute dichotomy by embedding vector operations, bulk Boolean logic, associative search, and even neural network primitives directly into dense flash arrays, addressing severe data movement and energy efficiency bottlenecks that afflict traditional von Neumann memory hierarchies when scaling to data-intensive workloads such as databases, LLMs, and neuromorphic inference tasks.
1. Device and Array Structures Enabling In-Flash Computation
At the core of 3D NAND-based PIM is the physical structure of the vertical NAND string: a stack of individually addressable floating-gate or charge-trap transistors connected in series between a bitline (BL) and source, grouped by wordlines (WLs) lithographically distributed along the vertical axis. High-density stacking (e.g., 48–128+ layers) yields the area efficiency necessary to amortize control and sense circuitry overhead for in-memory compute (Jang et al., 17 Nov 2025, Park et al., 2022, Bavandpour et al., 2019).
Advanced flash-compatible device engineering is exemplified by the integration of amorphous InGaZnO (a-IGZO) channels, which overcome the variability and low mobility of poly-Si, achieving high ON current (127 μA/μm at L_CH=60 nm), sub-pA OFF current, and superior retention/endurance for monolithic 3D stacking and compute (Sun et al., 2021). This enables backend-of-line (BEOL) co-integration of functional accelerator tiers—such as nonvolatile ternary content-addressable memory (TCAM)—above standard flash arrays.
2. Bulk Bitwise and Boolean Operations: The Flash-Cosmos Substrate
The Flash-Cosmos architecture demonstrates the direct exploitation of the analog conduction properties of NAND strings for “one-shot” Boolean reductions. Flash-Cosmos extends off-the-shelf 3D NAND dies with a microcontroller and new command set to support:
- Multi-Wordline Sensing (MWS): Simultaneous activation of multiple WLs either within a block (intra-block) for AND, or across blocks for OR, leverages analog threshold aggregation:
- Intra-block: ; SA outputs “1” iff all selected cells conduct (bitwise AND).
- Inter-block: ; SA outputs “1” if any selected cell conducts (bitwise OR).
- Enhancements: Built-in support for inversion (NAND/NOR) and XOR/XNOR via digital logic and command sequencing.
A single MWS step can compute up to 48-operand AND or 4-block OR in 25 ns (only 3.3% higher latency than a single-page read), achieving massive bit-level parallelism and drastically reducing data movement (Park et al., 2022).
3. Reliable In-Flash Computing: Programming and Error Mitigation
Flash cell nonidealities—chiefly threshold voltage (V_th) distribution overlap and retention/programming errors—constitute a main roadblock to PIM reliability. Flash-Cosmos’s Enhanced SLC-mode Programming (ESP) sharpens state separation by:
- Increasing program voltage () and shrinking incremental steps (), maximizing and minimizing .
- Empirically yields zero observed bit errors over 4.8×10¹¹ bits, 2×10⁻¹² statistical RBER, at 2× program latency.
- MWS operations (AND/OR) validated error-free up to 32-way operand reduction and over typical retention/endurance regimes.
Randomization (scrambling) disables MWS (since bit ordering is not preserved), but ESP negates the resultant RBER increase, ensuring that in-place Boolean primitives meet or exceed SSD uncorrectable bit-error rate (UBER) requirements (Park et al., 2022).
4. Analog/Mixed-Signal Arithmetic and AI: Time-Domain VMM and System Organization
Dense time-domain-encoded vector-by-matrix-multiply (VMM) circuits using 3D NAND flash (commercial 64-layer, 55-nm) enable mixed-signal dot-product operations suitable for neuromorphic and AI inference workloads:
- Time-domain encoding: Each digital input is mapped by a digital-to-time converter (DTC) to a pulse .
- Weight storage: Floating-gate cell encodes ; summing cell currents onto BL load capacitors during integration phase accumulates proportional to output voltage.
- Output conversion: A current sweep ramps voltage to threshold, with time-to-threshold digitized by a time-to-digital converter (TDC); final output encodes the dot product.
- Design point: 4-bit accuracy, nA, ns, ns, energy fJ/Op, throughput up to Ops/s (Bavandpour et al., 2019).
System-level organizations (e.g., 3D-aCortex) compose large arrays of VMM processing elements (PEs), time-shared or amortized periphery, with eDRAM for intermediate storage and dedicated DMA/operation controllers. Such architecture yields storage efficiency to 30.7 MB/mm² and energy up to 70.43 TOps/J.
5. Architectural Specialization: LLM Inference and Hierarchical Array Integration
Processing-in-memory in 3D NAND has been architected to address the bandwidth and capacity demands of LLM inference:
- Array organization: 256 rows × 2048 columns × 128 vertical stacks per plane; planes grouped into 8 channels, 4 ways, and 8 dies per way.
- On-die H-tree network: Reconfigurable PU (RPUs) provide pipelined accumulation and streaming, allowing on-the-fly reduction across planes.
- Operation tiling: Static-MVM (sMVM) mapped to QLC planes for LLM weights, dynamic-MVM (dMVM) mapped to SLC planes for key-value layers.
- Performance: End-to-end single-batch token generation is achieved with 2.4× speedup over four RTX4090 (vLLM), and only 4.9% overhead relative to four A100 GPUs, within a total array under 4.98 mm² die area per die (Jang et al., 17 Nov 2025).
This specialization demonstrates that properly engineered 3D NAND PIM can host entire LLM pipelines, reducing memory-wall bottlenecks by co-locating compute with persistently stored models, and by leveraging QLC/SLC trade-offs for non-volatile parameter storage vs. fast, low-error cache.
6. Associative Search and Nonvolatile TCAM with 3D-NAND-Compatible Devices
Amorphous-IGZO floating-gate transistors enable 3D NAND-compatible nonvolatile two-transistor ternary CAM (TCAM) cells for associative memory:
- Cell: Two FG transistors (T₀, T₁) encode “0,” “1,” “X” via V_TH programming; match-line sense robust to noise and variation.
- Experimental metrics: of , endurance 1,000 P/E cycles, 10-year retention window >0.9 V at 80 °C.
- Array-level simulations: Search energy ~6 fJ/bit (∼2.7× better than CMOS/ReRAM/FeFET TCAM), area ~2F², array-size scalability 240× higher.
- BEOL integration allows these TCAM layers to be vertically stacked above logic/data arrays, achieving ultra-high aggregate capacity per footprint (Sun et al., 2021).
Resulting architectures support in-place pattern search, hyperdimensional computing, and AI primitives with 100×–1000× lower energy/latency compared to off-chip approaches.
7. Limitations, Trade-offs, and Scaling Directions
Despite substantial gains, several intrinsic limitations and trade-offs persist:
- Operand locality: Bitwise operands for MWS must reside within power-constrained die boundaries or blocks; encryption and scrambling interfere with MWS patterning, requiring adaptation of encryption schemes or physical reordering.
- Programming overhead: ESP and SLC modes double storage cost and increase PIM-target page program time, though regular MLC/TLC flash is unaffected for capacity storage (Park et al., 2022).
- Voltage and endurance: a-IGZO TCAM employs relatively large program/erase voltages (±4–6 V) over ms-scale pulses; endurance ( cycles) and retention degrade with temperature, demanding further dielectric/material innovation (Sun et al., 2021).
- Peripheral scaling: At advanced nodes, physical limitations (DIBL, capacitive coupling, parasitics) are mitigated by time-domain encoding and periphery sharing; scaling predictions confirm further improvements in energy and density (Bavandpour et al., 2019).
- Workload mapping: PIM acceleration benefits require tailored data layout and compute offloading; LLM inference systems, for example, must partition static/dynamic operations and exploit array-level tiling for performance gains (Jang et al., 17 Nov 2025).
The aggregate effect of these design paradigms is the demonstration that, with moderate command/control extensions and operation-aware mapping, commodity 3D NAND can serve not only as high-density storage but also as a robust, energy-efficient compute substrate for diverse in-memory acceleration workloads, including Boolean logic, linear algebra, search, and AI inference.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free