Papers
Topics
Authors
Recent
2000 character limit reached

ACIM Accelerator: Analog Computing-in-Memory

Updated 13 December 2025
  • ACIM accelerators are analog computing-in-memory systems that use nonvolatile devices in crossbar arrays to perform parallel multiply-accumulate operations.
  • They integrate ReRAM, PCM, MRAM, or SRAM with peripheral ADC/DAC circuits to achieve in-situ deep neural network processing, reducing energy and latency.
  • Advanced design automation and ADC-less architectures optimize trade-offs in energy, area, and accuracy, enabling scalable, on-chip training and high throughput.

Analog Computing-in-Memory (ACIM) Accelerator

Analog Computing-in-Memory (ACIM) accelerators integrate analog crossbar arrays of non-volatile memory devices—such as ReRAM, PCM, MRAM, or advanced SRAM bitcells—tightly with peripheral mixed-signal and digital logic to realize massively parallel multiply-accumulate (MAC) operations directly inside memory. This architectural paradigm enables in-situ execution of deep neural network (DNN) workloads, simultaneously reducing energy, increasing throughput, and breaking the memory bandwidth bottleneck inherent to von Neumann architectures. Recent generations of ACIM accelerators combine precise analog weight transfer, long retention, on-chip training, and rich circuit-level design automation. The following sections detail foundational principles, device/circuit architecture, algorithmic integration, performance metrics, design trade-offs, and emerging automation frameworks.

1. Crossbar Array Architecture and Basic Operation

ACIM cores implement MAC, the dominant primitive in DNNs, via Ohm’s law in resistive crossbar arrays. Each cell at row ii, column jj contains a programmable nonvolatile conductance GijG_{ij}; voltage inputs ViV_{i} are applied to rows, yielding column currents Ij=iGijViI_j = \sum_{i} G_{ij} V_{i} (Falcone et al., 6 Feb 2025). In advanced arrays, memory cells are constructed using vertically stacked conductive-metal-oxide/HfOx filamentary ReRAM above nMOS selector transistors, enabling >32 stable multi-bit analog states and ultra-low programming noise (σprog0.1μ\sigma_\mathrm{prog} \lesssim 0.1\,\muS). Programming is achieved through closed-loop identical-pulse schemes, with set/reset operations molding the oxygen-vacancy filament via field-driven migration, supporting voltage amplitudes below 1.5 V, fully compatible with deep-submicron CMOS back-end-of-line integration.

Matrix weight mapping is typically linear, allowing trained digital weights wijw_{ij} to be encoded as conductance GijG_{ij} via Gij=Gmin+(wijwmin)(GmaxGmin)/(wmaxwmin)G_{ij} = G_{\min} + (w_{ij}-w_{\min}) (G_{\max}-G_{\min})/(w_{\max}-w_{\min}), subject to device and process constraints (Amin et al., 2023). For advanced nonlinearity or device variability, higher-order fit or lookup schemes are supported.

2. Device Characterization and Peripheral Circuit Integration

Device-level precision in ACIM platforms is characterized by the number of stable conductance states (NstatesN_\mathrm{states}), programming noise (σprog\sigma_\mathrm{prog}), long-term retention, and relaxation drift (ΔG(t)αlog10t\Delta G(t) \sim \alpha\,\log_{10}t). The CMO/HfOx ReRAM achieves Nstates>32N_\mathrm{states} > 32 and drift on the order of 0.04μ-0.04\,\muS per decade (Falcone et al., 6 Feb 2025).

Peripheral circuitries are equally crucial. Each crossbar row utilizes a digital-to-analog converter (DAC) to apply quantized input voltages (typically 4–8 bits), while column currents are digitized via analog-to-digital converters (ADCs). The architecture and quantity of ADCs, with their parameters—resolution (NbitsN_\mathrm{bits}), sample rate (fsf_s), area, and energy per conversion—substantially impact system performance. Energy per ADC conversion is modeled as EADC(fs,Nbits,Tech)E_\mathrm{ADC}(f_s, N_\mathrm{bits}, \mathrm{Tech}), governed by technology scaling and throughput requirements, with energy scaling exponentially in NbitsN_\mathrm{bits} and area linearly in the number of ADCs (Andrulis et al., 9 Apr 2024). Shared ADC configurations balance area against latency, while analog accumulation depth determines required ADC dynamic range.

Device non-idealities at scale—IR drop, parasitic resistance/capacitance, drift, and noise—are mitigated using hierarchical partitioning, repeaters, and compensation algorithms. SPICE-level simulation frameworks such as IMAC-Sim allow precise characterization of device/circuit-level trade-offs, enabling rigorous Pareto optimization for energy vs. accuracy vs. latency (Amin et al., 2023).

3. Compute Model: MAC, Training, and Nonlinear Activation

The mathematical kernel of ACIM is the analog matrix-vector multiplication (MVM), performed in constant time as Iout,i=jGijVin,jI_\mathrm{out,i}=\sum_j G_{ij} V_{\mathrm{in},j}. Peripheral SAR/flash ADCs convert integrated current/voltage to digital codes; quantization and noise models are intricately coupled: yA=yD+QADC+γy_A = y_D + Q_\mathrm{ADC} + \gamma, with γ\gamma denoting thermal/mismatch noise (Yoshioka et al., 9 Nov 2024).

On-chip training of DNNs is realized via algorithms such as Tiki-Taka v4/AGAD, mapping gradient descent updates (ΔW=ηL/W\Delta W = -\eta\,\partial L/\partial W) into stochastic pulse trains capable of outer-product weight modulation using coincident events on WLs/BLs. Pulse-based conductance change introduces asymmetry and noise (NSR90%\mathrm{NSR} \approx 90\%), necessitating digital correction and adaptive compensation (Falcone et al., 6 Feb 2025).

Nonlinear activation functions, including sigmoid and softmax, can be approximated in hardware either via programmable analog content-addressable memory (CAM) structures (e.g., Compute-ACAM) or stochastic analog front-ends exploiting device thermal noise (Dang et al., 27 Dec 2024, Zhao et al., 2023). These approaches eliminate explicit activation logic and ADC/DAC overhead, enabling direct in-memory nonlinear function evaluation, with area/energy benefits approaching 40%-40\% and energy efficiency up to $150$ TOPS/W for ReRAM-based implementations using intrinsic noise-driven stochastic binarization.

4. Design Space Exploration and Automation

Exploring the optimal parameters for ACIM—array size (R,C), ADC bits, encoding scheme, partitioning level—is a multidimensional problem with strong trade-offs. Models such as Andrulis et al.'s architecture-level ADC energy/area estimates (Andrulis et al., 9 Apr 2024), and multi-objective genetic algorithms (MOGA) as implemented in EasyACIM (Zhang et al., 12 Apr 2024), frame the design space in terms of throughput (TOPS), energy per MAC (fJ–pJ), area efficiency (GOPS/mm2^2), and SNR. Parameters such as local group size (LL) and ADC bits (BADCB_\mathrm{ADC}) are "knobs" affecting parallelism vs. area/energy, and joint optimization yields rich Pareto frontiers.

Automated LLM-driven frameworks (e.g., LIMCA (Vungarala et al., 17 Mar 2025)) generate, evaluate, and calibrate SPICE netlists for new crossbar designs against user-specified constraints (energy, area, accuracy) in closed-loop pipelines. These approaches leverage extensive datasets and in-context learning to bypass human-in-the-loop bottlenecks, rapidly identifying hardware-constrained solutions.

Hybrid ACIM architectures, employing dynamic digital-analog splits, saliency-aware boundary selection, or majority-voting on MSB cycles, balance efficiency and accuracy for deep networks, especially in high-precision and transformer models (Yoshioka et al., 9 Nov 2024, Negi et al., 20 Mar 2024, Moradifirouzabadi et al., 8 Sep 2024). Quantization-aware and noise-aware training is vital to maintaining accuracy amidst device/circuit-level variability.

5. Performance Metrics, Benchmarks, and Comparative Analysis

Core ACIM metrics include:

Metric Exemplary Value (CMO/HfOx ReRAM) SRAM ACIM (charge-based) YOCO (hybrid)
Energy efficiency 100–150 TOPS/W ∼100 TOPS/W 123.8 TOPS/W
Inference RMSE 0.06 (1s) / 0.20 (10y) <0.79% VMM error <0.79%
Precision >32 states (5 bits/cell) 6–10 bits (ADC) 8 bits
Area efficiency 372 GOPS/mm2^2 (SRAM)
Programming noise 10–100 nS
Accuracy loss MNIST <1.4% <1% <0.4% (ResNet18)
Throughput (per tile) ∼10–100 ns per MVM 26.2 TOPS

YOCO's charge-domain ACIM achieves ∼123.8 TOPS/W and <0.79% error (8-bit), substantially better than prior charge-SRAM or ReRAM-only baselines (Xuan et al., 2023). Hybrid analog-digital accelerators for transformer attention kernels report 14.8 TOPS/W, and charge-domain analog cores deliver 976.6 GOPS/mm2^2 (Moradifirouzabadi et al., 8 Sep 2024).

Elimination of high-precision ADCs via comparator-only digitization or in-memory digital acceleration reduces energy up to 28× (7-bit ADC → comparator) (Negi et al., 20 Mar 2024). Advances such as the nonlinear-in-memory NLIM ADC support direct analog approximation of activation functions (sigmoid, tanh, etc.) with <1<1 LSB error; combined measured macros and digital PEs yield 92.0% on-chip accuracy in LSTM keyword-spotting with up to 245 GOPS/mm2^2 system-level area efficiency (Yang et al., 6 Dec 2025).

For kernel/attention approximation, mixed-signal ACIM with per-column affine compensation demonstrates <1% accuracy drop in kernelized ridge regression and transformer benchmarks, at ∼10× energy efficiency over digital accelerators (Büchel et al., 5 Nov 2024).

6. Key Challenges, Mitigations, and Future Directions

ACIM faces challenges including device asymmetry, retention drift, IR drop, inter-device variability, and peripheral circuit scaling. Noise-to-signal ratios up to 90% and device skew >60%>60\% are mitigated via algorithmic compensation and background retraining (Falcone et al., 6 Feb 2025). For large arrays, hierarchical partitioning and 3D-NAND-style segmentation address IR drop and parasitic-induced errors.

ADC/quantization overhead is a dominant bottleneck; compute-aware SNR (CSNR) models allow ADC precision reduction by 3 bits with 6 dB CSNR gain over SQNR-optimal choices, translating to 40–64× ADC energy savings (Kavishwar et al., 13 Jul 2025). Comparator-based designs or ADC-less architectures generalize this efficiency via co-design.

Accurate system-level deployment demands tight algorithm-hardware co-design, with robust quantization/noise-aware training, careful array sizing, dynamic boundary tuning, and layout-aware tile clustering for minimizing crossbar parasitics and maximizing utilization.

ACIM design automation now leverages LLMs, genetic algorithms, and open-source SPICE-driven simulation tools (IMAC-Sim, EasyACIM, LIMCA) that permit end-to-end, constraint-driven hardware-software co-design for rapid iteration and optimal deployment to edge AI (Amin et al., 2023, Zhang et al., 12 Apr 2024, Vungarala et al., 17 Mar 2025).

Continued progress in ACIM is being shaped by circuit innovations, scalable design methodology, integration of nonlinear analog primitives, energy-optimized ADC architecture, hybrid digital-analog partitioning, and automated hardware-aware design frameworks. These approaches collectively advance ACIM toward practical, energy-efficient, and high-accuracy analog deep learning accelerators for real-world DNN applications.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Analog Computing-in-Memory (ACIM) Accelerator.