Computing-in-Memory Accelerators
- Computing-in-Memory Accelerators are hardware architectures that merge storage and computation using nonvolatile memory arrays to perform analog matrix–vector multiplications efficiently.
- They eliminate the von Neumann bottleneck by computing directly in memory, enabling dramatic improvements in energy efficiency, parallelism, and throughput for deep neural network workloads.
- Practical implementations face challenges from device nonidealities, peripheral limitations, and robust training requirements, addressed through hardware-software co-optimization.
Computing-in-memory (CIM) accelerators, also known as in-memory computing (IMC) or analog/mixed-signal neural accelerators, are a class of hardware architectures that physically merge storage and computation—typically by exploiting nonvolatile memory arrays (e.g., PCM, RRAM, MRAM, FeFET, SRAM)—to perform core linear algebra primitives such as matrix–vector multiplications (MVMs) entirely within the memory array. By eliminating the von Neumann bottleneck of repeated weight and data shuttling between memory and digital processing units, CIM enables orders-of-magnitude improvements in energy efficiency, parallelism, and throughput for deep neural network (DNN) workloads. However, the practical realization of CIM accelerators involves a confluence of device, circuit, architectural, and algorithmic challenges, including device nonidealities (variation, noise, asymmetry), peripheral overheads, robust training/deployment strategies, and compilation/scheduling for diverse architectures.
1. Physical Principles and Array-Level Operations
CIM accelerators center on memory arrays where each cross-point encodes a weight as a programmable conductance. For nonvolatile devices such as PCM, RRAM, or MRAM, the conductance at cross-point encodes neural weight . In analog CIM, input activations are applied as row voltages via DACs, and the resulting bit-line currents —summing via Ohm's and Kirchhoff's laws—are digitized by ADCs or further processed in analog or digital logic (Wu et al., 2024, Yan et al., 2022, Yu et al., 2019).
The core array operation is the analog multiply–accumulate (MAC), physically realized as
computed in parallel across a full row or block. For binary or quantized networks, bit-serial/bit-parallel schemes or extreme quantized training (e.g., partial sum quantization) are applied to push array outputs to binary/ternary levels, enabling further hardware simplification (Negi et al., 2024, Yan et al., 2023, Yan et al., 2022).
Analog in-memory training is feasible by exploiting the rank-1 structure of outer-product gradient updates, where weight updates are delivered via simultaneous pulses along two crossbar dimensions, with the device's physical state updated directly via voltage/current pulses. However, device nonidealities such as asymmetric conductance changes, stochastic write noise, and nonlinearity must be explicitly modeled: where is the learning rate, is the gradient, and is a device-specific symmetry parameter (Wu et al., 2024).
2. Device Nonidealities and Robustness Techniques
CIM accelerators are fundamentally impacted by device–level nonidealities, notably:
- Programming and read noise: Modeled as additive Gaussian perturbations (e.g. ), these shift stored weights, causing inference accuracy loss of up to 10% under realistic (Yan et al., 2021, Yan et al., 2022).
- Asymmetric/nonlinear updates: The incremental conductance change per program pulse is state-dependent and generally biased (e.g., ) (Wu et al., 2024).
- Device-to-device and cycle-to-cycle variations, retention drift, and stuck-at faults: Impact robustness, especially for large or deep networks (Yan et al., 2022, Yan et al., 2022, Yu et al., 2019).
Algorithmic and system-level countermeasures include:
- Variation-aware/noise-injection training: Randomly perturb weights during training, thereby “robustifying” learned parameters to device noise (Yan et al., 2021, Yan et al., 2022).
- Negative feedback or variational training: Directly penalize directions in weight space that amplify noise-induced loss, as in Oriented Variational Forward (OVF) training (Qin et al., 17 Aug 2025).
- Selective or universal write-verify: Use second-derivative sensitivity metrics to identify only those weights whose errors actually impact performance, reducing programming time by up to 10× while maintaining nominal accuracy (Yan et al., 2022, Yan et al., 2023).
- Post-training optimization: Adjust column-wise conductance and DAC input ranges post-training to optimally position activation and weight ranges, repairing mapping-induced degradation without additional retraining (Lammie et al., 2024).
- Worst-case analysis: For safety-critical systems, formulate and solve the worst-case adversarial perturbation problem over bounded balls in weight space to certify robustness (Yan et al., 2022).
- Device/circuit/architecture co-exploration: Jointly search over device selection, quantization, circuit topology, and neural architecture to maximize accuracy and energy efficiency under real-world variation (Jiang et al., 2019).
3. Peripheral and System Architecture
Peripheral circuits—DACs and especially ADCs—are often the dominant bottleneck in CIM systems. For conventional analog MAC arrays, ADCs account for >60% of total power and area (Negi et al., 2024, Ghosh et al., 2024). Multiple approaches mitigate this:
- Low-precision/quantized peripheral design: Use quantization-aware training to tolerate extreme low-precision (binary/ternary) PSQ outputs, obviating the need for expensive ADCs (Negi et al., 2024).
- Hybrid analog-digital architectures: HCiM couples analog crossbars with bit-serial digital in-memory adders to process low-precision outputs, achieving up to 28× energy reduction over 7-bit ADC baselines, with only 1–2% top-1 accuracy loss (Negi et al., 2024).
- Stochastic conversion: Spin-orbit torque MTJ converters transform analog partial sums to stochastic bitstreams, eliminating the ADC entirely and achieving area, energy, and EDP reductions of up to 142×, 22×, and 666×, respectively, relative to full-precision ADC architectures (Rogers et al., 2024).
- Approximate/relaxed ADCs: Peripheral-aware design and variation-aware training tolerate INL/DNL and process variation, removing per-column calibration logic and enabling operation at reduced VDD (Ghosh et al., 2024).
- Floating-point analog domain: Adaptive FP-ADC and FP-DAC architectures support FP8 (E2M5) arithmetic in RRAM crossbars, yielding >19 TFLOPS/W efficiency while dynamically adapting conversion ranges for precision and power (Liu et al., 2024).
4. Parallelism, Scalability, and Programming Methodologies
Scaling CIM to large networks requires exploiting parallelism without incurring prohibitive weight copying:
- Model and pipeline parallelism: Analog CIM prohibits classic data-parallel training due to the high cost of weight copying. Synchronous and asynchronous pipeline parallelism enable layer-wise model partitioning across tiles, maintaining device utilization near unity with convergence guarantees, and amortizing latency over multiple micro-batches (Wu et al., 2024).
- Flexible compilation and mapping: The CIM-MLC compilation stack abstracts diverse architectures (device type, crossbar size/precision, compute modes) and jointly optimizes operator-to-core/crossbar scheduling, attaining 2–4× throughput improvement over prior toolchains (Qu et al., 2024).
- Efficient weight and activation packing: By carefully packing layer weights, overlapping macro usage in time-multiplexing, and maximizing parallel compute unit occupancy, mapping algorithms can minimize weight-loading overheads (<20% area increase) and close the gap to the theoretical 10–100× EDP advantage of CIM (Houshmand et al., 2024).
- Spatial mapping with mixed precision: Joint RL and ILP search for layer replication and bitwidth allocation yields up to 9× latency and 19× throughput improvements at iso-accuracy, by matching tile provision to the computational and memory requirements of each DNN layer (Nallathambi et al., 2023).
5. DNN Robustness, Architecture Co-Search, and Deployment
CIM accelerators require co-optimization between DNN topology and hardware to ensure robust, efficient inference:
- Robust architecture search: RL-based or cross-layer NAS extensions search for DNN topologies that maximize mean and worst-case accuracy under device variation. Incorporating device-aware metrics into the training and evaluation loop enables architectures that lose ≤0.5% accuracy under realistic noise, versus catastrophic collapse in vanilla NAS (Yan et al., 2021, Jiang et al., 2019).
- Deployment and sustainability: New methods, e.g., OVF and selective write-verify, drastically reduce the need for on-chip weight reprogramming and calibration, lowering deployment energy and latency by orders of magnitude—essential for practical edge deployment (Qin et al., 17 Aug 2025, Yan et al., 2022, Yan et al., 2023).
- Post-training adaptation: Lightweight calibration techniques recover accuracy lost to quantization and analog nonidealities at minimal cost, supporting plug-and-play deployment even for large transformer models (Lammie et al., 2024).
- Safety-critical applications: Average-case optimization is insufficient for certification. Formal worst-case verification approaches must be integrated into both hardware and neural design stacks; otherwise, small variation can cause total accuracy collapse (Yan et al., 2022).
6. Device-Specific Platforms and Prototypes
A variety of device platforms underpin CIM accelerator designs:
- Memristive crossbars (RRAM, PCM, MRAM, FeFET): Support dense array-level storage and analog MAC operations, with models and prototypes demonstrating >10× improvements in energy, throughput, and area (Yu et al., 2019, Sun et al., 2018).
- Charge-domain CMOS-SRAM: Hybrid analog charge accumulation in bitcells, supporting fully configurable bit-parallel/bit-serial MAC, 152–297 1b-TOPS/W, and full software programmability (Jia et al., 2018).
- 3D XPoint (PCM + OTS): Demonstrated thresholded matrix-vector-multiply within stacked arrays for binary NNs, with close assessment of parasitic-induced limits on subarray scalability (Zabihi et al., 2021).
- MRAM-based PIM: Integrates >40 MB STT-MRAM for weight storage in CNN accelerators, enabling multi-model instantaneous switching and 9.9 TOPS/W at <300 mW for mobile/IoT (Sun et al., 2018).
7. Limitations, Open Problems, and Future Directions
While CIM accelerators unlock major energy and performance gains for DNN workloads, several areas remain open for research and improvement:
- Deeper modeling of correlated and non-Gaussian device errors: Current frameworks often presume i.i.d. Gaussian device noise, but real arrays exhibit complex spatial/temporal correlations (Yan et al., 2021, Yan et al., 2022).
- Scalable on-chip verification and in-situ diagnostics: Especially for safety-critical deployment, hardware support for efficient worst-case accuracy certification is needed (Yan et al., 2022).
- Hierarchical and mixed-precision architecture support: Upcoming large models (e.g., transformers) impose greater accumulator depth and more diverse scale-factor requirements, pressuring both analog front-ends and digital back-ends (Negi et al., 2024, Nallathambi et al., 2023).
- Integration of sparse, pruned, or structured networks: Mapping and scheduling stacks must natively support and exploit sparsity to maximize utilization and throughput (Qu et al., 2024).
- Algorithmic co-design: Combining architectural search, quantization/mixed-precision/training noise, error-correcting codes, and mapping methodologies in a single framework remains an open challenge (Jiang et al., 2019).
CIM accelerators are at the vanguard of post-von Neumann AI hardware, combining innovative device physics, circuit design, system architecture, and algorithm-hardware co-optimization. The multi-disciplinary advances surveyed here collectively underpin current and next-generation energy-efficient, robust, and scalable AI systems (Wu et al., 2024, Qin et al., 17 Aug 2025, Negi et al., 2024, Rogers et al., 2024, Qu et al., 2024, Yan et al., 2022).