In-Memory and Crossbar Architectures
- In-memory and crossbar-based architectures are hardware designs that integrate memory storage and computation, using crossbar arrays to perform matrix–vector multiplication and logic operations in situ.
- They leverage device technologies like ReRAM, PCM, and memristors to bypass the von Neumann bottleneck by computing directly within memory, reducing data movement and enhancing speed.
- Key challenges include managing sneak-path currents, parasitic resistances, and device variability, while optimizing ADC/DAC overhead to improve system efficiency.
In-memory and crossbar-based architectures constitute a rapidly expanding class of hardware platforms that collapse the physical separation between computation and memory, with the crossbar array—an orthogonal mesh of word- and bit-lines hosting memory or memristive devices at each intersection—serving as both storage medium and computational substrate. By leveraging direct physical laws (Ohm’s and Kirchhoff’s), these architectures can execute matrix–vector multiplication (MVM), logic, and reduction primitives at the site of data, dramatically reducing the classical memory bottleneck of von Neumann systems. Crossbar-based platforms span resistive RAM (ReRAM), phase change memory (PCM), memristors, and SRAM technologies, and are now integral to domain-specific accelerators for deep neural networks, sparse attention, cryptography, and emerging applications in secure/FHE and mixed-mode logic.
1. Architectural Foundations: Device, Array, and Peripheral Structure
Crossbar-based in-memory architectures organize two-terminal (or, in emerging work, four-terminal) memory devices at the intersection of word-lines (typically rows) and bit-lines (columns) (Lai et al., 12 Sep 2025, Kolinko et al., 2024, Esmanhotto et al., 2022, Du et al., 23 Jun 2025, Zhang et al., 21 Jan 2026). Each cell encodes a local synaptic weight, logic value, or multi-level conductance. Cells are implemented using technologies tailored for both storage and analog computation:
- ReRAM/PCM/memristor: Cells programmed into discrete resistance/conductance levels, e.g., 2-bit-per-cell ReRAM with four levels (Lai et al., 12 Sep 2025), phase-change voxels (Noori et al., 2021).
- 1T1R, 2T1R topologies: Selector transistors mitigate sneak paths and allow fine access granularity (Esmanhotto et al., 2022, Kolinko et al., 2024).
- Array granularity: 64×64, 128×128, up to 256×256 crossbars tiled into core-level compute fabrics; hybrid compositions support both digital and analog operation (Lai et al., 12 Sep 2025, Dong et al., 27 Nov 2025).
- Peripheral circuits: High-efficiency word-line drivers, multi-level DACs (input voltage generation), current-to-digital ADCs (6–8 bit, often Flash or SAR), sense-amps, and popcount/threshold detectors. Power and area budgets are often dominated by ADCs (Ibrayev et al., 2024, Dong et al., 27 Nov 2025).
Emerging architectures add four-terminal ion-intercalation devices (Zhang et al., 21 Jan 2026), structurally separating read and write paths to enable full-parallel writes and eliminate sneak-path current by construction, and novel cell types (mixed-mode BiFeO₃) supporting resistive or voltage-driven operation within each cycle (Du et al., 23 Jun 2025).
2. Physical Model: In-Memory Computation Primitives
The crossbar natively computes vector–matrix products via analog physical summation:
where is applied as voltages to word-lines, is the crossbar conductance matrix, and output currents on bit-lines realize the dot-products. Extensions include:
- Multi-level and binary operation: Devices programmed to multiple states for fixed-point, ternary, or binary logic (Lai et al., 12 Sep 2025, Esmanhotto et al., 2022).
- Logic primitives: Scouting Logic and mixed-mode cycles realize symmetric Boolean functions (NAND, NOR, XOR) and multi-bit addition in the analog domain, with reliability up to 16 parallel operands using drift-compensated programming (Esmanhotto et al., 2022, Du et al., 23 Jun 2025).
- Polynomial and cryptographic kernels: Convolutional and transform-based polynomial modular multiplication mapped to Toeplitz/circulant VMM in crossbars, with bit-plane mappings achieving high throughput and compactness (Li et al., 2023).
Peripheral ADCs digitize outputs for accumulation, routing, or subsequent logic; their precision and energy dominate system-level efficiency in large arrays (Ibrayev et al., 2024, Dong et al., 27 Nov 2025).
3. Performance-Limiting Nonidealities and Mitigation
Physical nonidealities, especially at array and device scale, fundamentally limit the reliability, precision, and energy-efficiency of in-memory crossbar computing:
- Sneak-path current: Parasitic routes through half-selected or unselected devices cause unwanted leakage, degrade noise margins, and ultimately constrain the maximum crossbar size. Closed-form analytical models capture the exponential dependence on array size , device ON/OFF ratio , and read bias ; models validated to <10.9% error vs. SPICE are 4784× faster for design-time assessment (Riam et al., 26 Nov 2025). Four-terminal designs eliminate sneak paths by architectural separation of read and write lines (Zhang et al., 21 Jan 2026).
- Parasitic line resistance: Row/column resistance causes IR-drop, attenuating and biasing the ideal output (Zhang et al., 2019, Kolinko et al., 2024). Mitigation leverages device selection, tuning, conversion-plus-calibration flows, and optimal voltage swing.
- Device-level variability and drift: ReRAM and PCM exhibit both device-to-device and cycle-to-cycle stochasticity, including power-law drift and 1/f noise (Petropoulos et al., 2020, Esmanhotto et al., 2022). Experimentally, smart-programming (FC-SP) and careful bit-level encoding achieve <1% error after one hour and cycles of endurance (Esmanhotto et al., 2022), while system-level emulators accurately replicate k-device drift/noise evolution (Petropoulos et al., 2020).
- ADC/DAC overhead: Digitization of analog outputs is a leading source of area and energy. D·U·B pruning strategies (Discrete, Unstructured, Balanced) induce hardware-efficient sparsity, lowering ADC precision requirements substantially (up to 7.13× energy savings) (Ibrayev et al., 2024). Techniques such as dynamic switch ADC gating adapt the enabled comparator set at runtime to the expected operation (e.g., single-row READ vs. MAC) (Lai et al., 12 Sep 2025).
4. Architectures and Algorithms for System-Level Efficiency
State-of-the-art in-memory crossbar systems optimize workload mapping, dataflow, and redundancy to maximize effective utilization and minimize system overhead:
- Embedding reduction & dataflow: ReCross, for DLRM workloads, employs correlation-aware grouping, frequency-based duplication, and dynamic ADC switching for embedding reduction, reaching 3.97× execution-time and 6.1× energy improvement over state-of-the-art (Lai et al., 12 Sep 2025).
- Load balancing and streaming: Block-wise array allocation, latency-proportional distribution, and pull-based streaming decouple sub-array synchronization, raising utilization from ~20% to >60% and providing up to 7.5× throughput over naive fixed-weight allocation (Crafton et al., 2020).
- Sparse attention kernels: Architectures such as CPSAA combine PIM-based mask computation, in-memory pruning (ReCAM scheduling), and optimized SDDMM/SpMM mapping to realize 89.6× performance and 755.6× energy improvement over GPU baselines for transformer models (Li et al., 2022).
- CADC (dendritic convolution): Embedding a ReLU or similar non-linearity into in-memory psum pipelines increases zero sparsity (up to 80%), allowing zero-compression, zero-skipping, and 11×–18× speedup plus energy efficiency gains (Dong et al., 27 Nov 2025).
Compiler frameworks (e.g., COMPASS) optimize DNN partitioning, weight assignment, and memory footprint under crossbar capacity constraints, extending accelerator reach to larger networks (e.g., VGG16) and providing 1.78× throughput and 1.28× EDP gains relative to greedy schemes (Park et al., 12 Jan 2025).
5. Design Automation and Hardware–Algorithm Co-Optimization
Automating design space exploration is imperative due to the vast combinatorial space of crossbar architectures, device technologies, array sizes, peripheral precision, and workload mappings:
- LLM-driven pipelines: LIMCA introduces a no-human-in-loop methodology leveraging a curated IMC-dataset, LLM-powered query/ranking, and automated SPICE validation, reducing DSE time by 11.5×–49.7× over manual methods; supports retrieval, ranking, and netlist generation for MRAM, RRAM, PCM, and CBRAM crossbar scenarios (Vungarala et al., 17 Mar 2025).
- Technology mapping and co-design: Crossbar-constrained mapping flows handle explicit device/array constraints, synthesizing logic or arithmetic networks into feasible crossbar schedules and device allocations, with area- or delay-optimal tradeoffs (Bhattacharjee et al., 2018). Optimization must consider wordline/bitline sharing, ESOP complexity, and the superlinear impact of array width/depth.
- Material-device-algorithm co-design: The physical/algorithmic co-optimization paradigm integrates device-level variability, peripheral design, and workload tolerance, employing noise-aware training, quantization-aware models, and device characterization to maximize system fidelity and efficiency (Haensch et al., 2022).
6. Security, Mixed-Mode Logic, and Emerging Directions
Crossbar-based platforms are serving as the foundation for native cryptographic operations, secure inference, and mixed-mode logic primitives:
- Encrypted in-memory inference: Secure BNN architectures utilize PUF-derived keys to transform and store weights in crossbars; inference is conducted directly on encrypted weights with <1% energy overhead and accuracy collapse (<15%) in the absence of a key (Rajendran et al., 27 Oct 2025).
- Stochastic hyperdimensional cryptography: Architectures such as HYPERLOCK employ memristor variability as entropy, mapping inputs to binary hypervectors for cryptographic transformation within the same crossbar used for ML, recoverable by neural decoders—demonstrating high resilience to noise/non-idealities (Cai et al., 2022).
- Mixed-mode computing: Crossbars operating in both resistive and voltage modes within a single device cycle enable dense logic mapping, eliminate the need for sense amplifiers, and minimize area-delay product (A-D-P); SAT-based co-design tools synthesize minimal cell/cycle mappings for arithmetic and cryptographic primitives (Du et al., 23 Jun 2025).
- Full-parallel write architectures: New four-terminal ion-intercalation memristors enable structural decoupling of read and write paths, allowing full-array parallel programming without sneak path paths, with an speedup in bandwidth and energy efficiency compared to 2T crossbars (Zhang et al., 21 Jan 2026).
7. Outlook and Design Guidelines
Crossbar-based in-memory architectures have established a foundational model for efficient, parallel computation at memory, yet future scalability and adoption depend on:
- Mitigating nonidealities: Device selection (e.g. high ON/OFF PCM, optimized RRAM/ECRAM), improved selectors, and array-level biasing or grounding strategies (Riam et al., 26 Nov 2025, Noori et al., 2021).
- Balancing ADC precision and sparsity: Exploiting data-driven pruning, dynamic mode switching, and flexible peripheral allocation to extract maximum system efficiency (Ibrayev et al., 2024, Lai et al., 12 Sep 2025).
- Compiler and mapping co-design: Integrating device, array, and workload constraints into hardware-aware compilers and synthesis flows (Park et al., 12 Jan 2025, Bhattacharjee et al., 2018).
- Enabling security and adaptivity: Embedding cryptographic primitives (PUF-based or noise-based) and rapid, parallel in-array weight updates for dynamic, secure, and reprogrammable computation (Rajendran et al., 27 Oct 2025, Cai et al., 2022, Zhang et al., 21 Jan 2026).
- Exploring mixed-signal and hybrid integration: Mixed resistance/voltage modes, analog/digital partitioning, and hybrid 4T–2T stacks to address functional, throughput, and reliability objectives (Du et al., 23 Jun 2025, Zhang et al., 21 Jan 2026).
In summary, the in-memory and crossbar-based architecture field is defined by the interplay of device physics, analog compute, workload-aware mapping, and robust system software. Its ongoing evolution is characterized by the drive toward minimizing data movement, maximally exploiting array-level parallelism, and algorithm/device co-design to realize efficient, scalable, and robust specialized accelerators (Lai et al., 12 Sep 2025, Esmanhotto et al., 2022, Li et al., 2023, Kolinko et al., 2024, Vungarala et al., 17 Mar 2025, Zhang et al., 21 Jan 2026).