Crossbar Arrays & Memory-Compute Co-location

Updated 3 April 2026

Crossbar arrays are nanoscale architectures that integrate memory storage and computation on intersecting wordlines and bitlines to perform in-place multiply-accumulate operations.
Memory-compute co-location leverages peripheral circuits like DACs and ADCs to execute analog or digital computations directly within the memory array, significantly reducing data transfer overhead.
Advanced mapping and optimization techniques enhance utilization and mitigate non-idealities, yielding substantial improvements in energy efficiency and throughput for neural network inference and general computing tasks.

Crossbar arrays are nanoscale or microscale structures in which programmable memory devices are arranged at the intersection of perpendicular lines (“wordlines” and “bitlines”), enabling both high-density data storage and in-place analog or digital computation. Memory-compute co-location refers to architectural strategies that perform the majority of computational operations directly within the memory array—particularly multiply-accumulate (MAC) operations using Ohm’s and Kirchhoff’s circuit laws—effectively collapsing the logic and memory units into a unified physical fabric. This removes the von Neumann bottleneck by eliminating energy-expensive data shuttling between memory and ALU, providing orders-of-magnitude improvements in energy per MAC, latency, and on-chip storage density for neural network inference and other tasks (Haensch et al., 2022, Haensch, 2024, Lai et al., 12 Sep 2025).

1. Physical Structure and Computational Model of Crossbar Arrays

Each crossbar cell consists of a memory device (e.g., RRAM, FeFET, PCM, MRAM, SRAM, or ferroelectric diode) whose conductance $G_{ij}$ encodes a weight or logical state at the intersection of row $i$ and column $j$ . Input voltages $V_i$ are applied to wordlines, and the resulting currents $I_{ij} = G_{ij} V_i$ are summed along bitlines, implementing analog MVM operations in one step (vector-matrix multiply: $\vec{I}_{\text{col}} = G \cdot \vec{V}_{\text{row}}$ ) (Haensch et al., 2022, Haensch, 2024, Wang et al., 2023). For signed weights, a differential encoding scheme with pairs of devices per entry is used: $w_{ij} \propto G_{ij}^+ - G_{ij}^-$ (Yousuf et al., 11 Jan 2026, Petropoulos et al., 2020).

Key peripheral circuits include:

Input DACs to generate analog voltages from digital activations.
Output ADCs to digitize bitline sums.
Row/column selectors, drivers, and local control logic for address decoding and experiment orchestration.

The co-location of memory (weight storage) and compute (MAC) physically in the same array tile is fundamental. All multiply-accumulate operations occur “where the data reside,” without transferring weights off-array (Haensch et al., 2022, Haensch, 2024).

2. Memory-Compute Co-location: Architectures and System Design

Memory-compute co-location is instantiated at various spatial scales:

Array and Tile Level Co-location: Each crossbar tile integrates memory cells, periphery for conversion (DAC/ADC), and minimal logic. Matrix-vector multiplications for DNN inference proceed within the tile, without off-tile data movement (Haensch, 2024, Xie et al., 2023).
Hierarchical Integration: Large arrays are partitioned into tiles or “M-Cores”; each can be dynamically re-configured as memory, arithmetic unit, or neuromorphic processor (Zidan et al., 2016).
Peripheral Resource Optimization: Sharing ADC/DACs, customized bit-slicing, and dynamic switching strategies are essential peripheral co-design elements for high utilization and optimal energy/area trade-offs (Haensch, 2024, Lai et al., 12 Sep 2025).

In digital and analog modes, crossbar tiles support population-count operations, logic (e.g., MAGIC NOR/NOT gates), and MVMs (Zidan et al., 2016, Bhattacharjee et al., 2020).

The following table summarizes representative device types and co-location features:

Memory Device	Native Mode	Suitability for Co-location
ReRAM (1T-1R)	Analog/digital	High density, multilevel, moderate variation
FeFET	Analog/digital	Small cell, robust scaling, high RHRS/RON
PCM	Analog	Fast, multilevel, drift/variation challenges
MRAM	Digital	High endurance, binary, limited scaling
SRAM (8T)	Digital	High ON/OFF, large cell, sensitive to IR drop

(Wang et al., 2023, Victor et al., 2024, Haensch et al., 2022)

3. Technology and Circuit Non-Idealities: Scaling, Robustness, and Remedies

While memory-compute co-location realizes massive data-movement reduction, system-level performance and accuracy hinge on a nuanced balance among device, circuit, and architecture design:

Wiring Non-idealities: Line resistance (high in scaled nodes) induces IR drops, distorting MAC linearity. Techniques such as row-bit agglomeration (SWANN) and Partial Wordline Activation (PWA) mitigate these errors and restore DNN accuracy (Victor et al., 2024, Victor et al., 2024).
Device Variability and Noise: Stochastic and systematic variations (e.g., conductance drift in PCM, resistive state dispersion in ReRAM, switching noise in FeFET) degrade computational fidelity. Model-based compensation (MLP emulation, hardware-aware training) and controlled refresh are necessary (Petropoulos et al., 2020, Yousuf et al., 11 Jan 2026).
Peripheral Overhead: ADC/DAC design dominates area and energy for high-precision (e.g., 8+ bits). Dynamic and shared architectures (dynamic-switch ADCs (Lai et al., 12 Sep 2025), custom reference thresholds (Victor et al., 2024)) are employed to optimize accuracy/energy under sparsity.
Selector-Free or Selector-Based Designs: Advanced devices (e.g., AlScN FE diodes) exploit rectification and exponential I–V to suppress sneak paths without selectors, supporting high scaling and temperature robustness (Han et al., 5 Jun 2025).

4. Mapping, Packing, and Utilization: Algorithmic and Architectural Strategies

Efficient mapping of neural networks or logic onto crossbar arrays is critical for maximizing utilization and minimizing area/latency.

Bin Packing and Fragmentation: Neural-network layers are split into sub-blocks matched to tile capacities. Packing algorithms (greedy, integer programming) account for tile aspect ratio, fragmentation, and peripheral scaling. Minimum-tile count and minimum-area optima do not always coincide due to inter-tile and periphery trade-offs (Haensch, 2024).
Pipeline vs. Dense Packing: Pipeline packing enables low-latency inference at the cost of increased area; dense packing minimizes area but may elongate critical paths (Haensch, 2024).
Correlation-Aware Mapping: Application-specific dataflow (e.g., in DLRM embedding reduction) exploits co-occurrence graphs and access statistics for grouping, duplication, and adaptive crossbar usage, boosting effective utilization (>85% in ReRAM crossbars with ReCross) (Lai et al., 12 Sep 2025).
Block Allocation and Synchronization: For parallel fabrics, dataflow-aware allocation (e.g., block-based array assignment for ResNet-18) sustains utilization >90%, yielding 7.47× speedup over naïve allocation (Crafton et al., 2020).

5. Quantitative Gains: Energy, Throughput, and System-Level Impact

Empirical and modeled results demonstrate the profound impact of crossbar-based memory-compute co-location:

Energy per MAC: In analog mode, per-MAC energy is reduced by up to 100× versus digital SRAM, subject to peripheral overheads and sharing (Haensch, 2024, Haensch et al., 2022).
Throughput and Latency: Single-cycle MVM enables O(1) time per input vector; practical designs achieve up to 10× faster inference than digital systolic arrays for large N (Petropoulos et al., 2020, Li et al., 2023).
Utilization and Mapping Efficiency:

| Optimization | Utilization | Performance Gain | |-------------------------|-------------|-------------------------------------| | Naïve mapping | 25–40% | Baseline | | Correlation-aware (ReCross) | >85% | 3.97× faster, 6.1× energy efficiency| | SWANN + PWA (128×128 SRAM) | 88.8% | Accuracy from 47.8% → 88.8% |

(Lai et al., 12 Sep 2025, Victor et al., 2024, Crafton et al., 2020)

Selector-Free Arrays: 128×128 AlScN FE-diode crossbars achieve 2500 bits/mm², energy per MAC ~1 nJ for 128-term VMM, and robust operation up to 600 °C (Han et al., 5 Jun 2025).

6. Application Domains and Extension to General Computing

Beyond DNN inference:

In-Memory Reduction and Sparse Operations: ReRAM crossbars in ReCross eliminate DRAM-CPU bottlenecks for large recommendation models by processing reductions in situ, with adaptive ADC switching (Lai et al., 12 Sep 2025).
Polynomial Modular Multiplication: X-Poly maps Conv1D directly into crossbars, achieving up to 200× speedup over CPUs for cryptographic PMM (Li et al., 2023).
Digital and Analog Logic: Magic-style logic (e.g., NOR, NOT) can be implemented in memristive crossbars, with mapping tools such as CONTRA optimizing area-delay product for Boolean circuits (Bhattacharjee et al., 2020).
Hybrid Electrical-Optical Computing: Plasmonic nonvolatile crossbars enable dual electrical-optical VMM, sub-10 fJ/MAC energy, and ultrahigh throughput, representing a path towards three-dimensional, mixed-signal neuromorphic hardware (Gosciniak, 2021).

7. Co-Design and Future Prospects

A convergence of device, circuit, and system-level co-design is necessary. Key principles include:

Co-optimizing array size, aspect ratio, and peripheral circuit scaling for each workload (Haensch, 2024, Victor et al., 2024).
Developing robust device models, calibration protocols, and system-level emulation tools (e.g., XBTorch) to bridge device non-idealities and algorithmic requirements (Yousuf et al., 11 Jan 2026, Petropoulos et al., 2020).
Advancing 3D-integration strategies to realize higher density, peripherally efficient architectures (Haensch et al., 2022, Zidan et al., 2016).
Algorithm-device co-optimization (training with noise/drift awareness, error correction, quantization strategies) to tolerate and even exploit hardware non-idealities (Yousuf et al., 11 Jan 2026).

A major direction is extending in-memory compute beyond neural inference to general-purpose workloads, as demonstrated in reconfigurable fabrics (FPCA) and in-memory logic frameworks (CONTRA), suggesting increasing scope for memory-compute co-location across the computational stack (Zidan et al., 2016, Bhattacharjee et al., 2020).

References: