Papers
Topics
Authors
Recent
2000 character limit reached

Compute In Memory (CIM) Overview

Updated 16 January 2026
  • Compute-In-Memory (CIM) is a paradigm that integrates computation with memory storage to perform in-memory arithmetic and logic operations, reducing data movement.
  • CIM encompasses both analog and digital modalities, where ACIM delivers ultra-low-energy MAC operations and DCIM provides higher precision through in-array logic.
  • CIM architectures leverage optimized dataflows, adaptive quantization, and memory hierarchy insertion to achieve significant gains in DNN acceleration and edge AI applications.

Compute-In-Memory (CIM) is a paradigm that unifies computation and memory storage within the same physical array, enabling direct in-memory arithmetic and logic operations, predominantly vector-matrix multiplication (VMM) and multiply-accumulate (MAC) operations. By breaking the traditional von Neumann separation between memory and processing elements, CIM addresses the memory wall and data-movement energy bottlenecks inherent in conventional architectures, especially for machine learning inference and data-intensive workloads (Yoshioka et al., 2024, Wolters et al., 2024).

1. Fundamental Principles and Device Technologies

CIM architectures implement the core arithmetic—typically dot-products—within the memory array, eliminating the need for repeated, high-energy data transfers between off-array memory banks and external processing units (Yoshioka et al., 2024, Wolters et al., 2024). Two canonical compute modalities dominate:

  • Analog CIM (ACIM): Utilizes nonvolatile memory (NVM) or SRAM cells programmed to variable conductance levels. When inputs are applied as analog voltages on wordlines, per-cell multiplication (via Ohm's law, i=G Vi = G\,V) and intrinsic accumulation (via Kirchhoff's law, ∑iGiVi\sum_i G_i V_i) are realized. Device types include resistive RAM (ReRAM), phase-change memory (PCM), FeFETs, and MRAM. ACIM offers sub-femtojoule energy per MAC and densities up to 10210^2 F2^2/cell, but is limited in precision (typically 3–8 bits) by device variability, nonlinearity, IR-drop, and the necessity of high-precision ADCs (Yoshioka et al., 2024, Zhou et al., 2023).
  • Digital CIM (DCIM): Augments traditional SRAM or emerging memory arrays with in-array logic gates (e.g., AND, NAND, full adders) to perform bitwise multiplication and accumulation. High-precision (8–16 bits, BF16, FP8) is attainable, with process scaling benefits, but area and energy overheads are substantially above ACIM in medium-precision regimes (Yoshioka et al., 2024, Yi et al., 11 Feb 2025).

Hybrid-domain CIM architectures have also emerged, splitting MACs into digital (MSB) and analog (LSB) sub-operations, yielding significant energy savings at nearly full digital accuracy (Yi et al., 11 Feb 2025, Yoshioka et al., 2024).

2. Circuit Implementations and Core Operations

CIM cell design directly shapes area, energy, precision, and temperature/voltage resiliency:

  • SRAM-based CIM: The canonical 6T-SRAM bit-cell can be repurposed for in-memory AND (multiply) by activating wordlines and treating bitlines as data inputs. Multi-bit operations are constructed through bit-serial accumulation, with DCIM adders aggregating partial sums on-array (Yoshioka et al., 2024).
  • NVM-based Crossbars: In ReRAM or FeFET-based crossbars, each memory cell encodes a synaptic weight as conductance GG. Applying input activations as analog voltages across the array, and summing resulting currents, enables direct implementation of Iout,j=∑iGi,jViI_{\mathrm{out},j} = \sum_i G_{i,j} V_i. Current is digitized by peripheral ADCs. Notably, subthreshold-operated FeFETs with 2T–1FeFET feedback can mitigate exponential current drift with temperature, maintaining stable MAC output across wide temperature ranges (Zhou et al., 2023).
  • Ternary and Ultra-Low-Precision Cells: For signed-ternary DNNs, SiTe-CiM uses two-bit differential encoding and cross-coupling to enable in-place ternary dot-products. Voltage-mode and current-mode readout schemes (SiTe-CiM I/II) enable system throughput and energy improvements up to 7× and 2.5× respectively over near-memory baselines, with area overheads ranging from 6–34% (Thakuria et al., 2024).
  • Hybrid Analog/Digital Cell Enhancements: In FP8 hybrid-domain CIM, pseudo-AND/XOR gates enable both energy-efficient analog multiplication and robust digital addition. Switched-capacitor charge pooling and 3-bit flash ADCs minimize the number of conversion operations, supporting high-precision DNN inference and training with system errors under 1.5% and 1.53× energy improvement over digital baselines (Yi et al., 11 Feb 2025).

3. Architecture, Dataflow, and System Integration

CIM integration is governed by macro/array tiling, buffer hierarchy, dataflow, and computational scheduling, directly impacting overall throughput, energy efficiency, and scalability:

  • Memory Hierarchy Insertion: CIM macros can be placed at RF (register file), L1, SMem (shared memory), or L3 hierarchy levels. RF-level boosts peak throughput via replication and bandwidth; SMem maximizes data-reuse and energy reduction through larger capacity (Sharma et al., 2023, Gao et al., 2019). Empirical results support 3.4× to 8.3× system energy improvement (RF analog macros) and up to 15.6× throughput gains for high-intensity GEMMs (e.g., BERT FC layers).
  • Dataflow Optimization: For convolutional architectures, recent advances in CIM dataflow (e.g., ConvDK) use kernel duplication, intra-tile cyclic shifts, and group convolution mapping to attain 77–87% buffer-traffic reduction and 10–18% aggregate energy savings in depthwise convolutional layers over MobileNet/EfficientNet baselines (Song et al., 20 Aug 2025). Block-wise allocation and cross-layer scheduling can dramatically improve PE utilization (21–29× speedup, 30% utilization) in RRAM-tiled arrays (Crafton et al., 2020, Pelke et al., 2024).
  • Adaptive Precision and Reconfiguration: Macro architectures supporting per-bit and per-shape operand resolution (e.g., FlexSpIM for SNNs) enable hybrid weight/output stationarity, reducing system-level data movement and enabling 2× bit-normalized energy efficiency with resolution reconfiguration granularity of [1, 512] bits (Chauvaux et al., 2024).
  • Dual-Mode and Hybrid Compilation: Compilers such as CMSwitch treat each CIM array as dynamically reconfigurable, selecting between compute and memory modes per DNN segment (via dynamic programming and MIP), enabling up to 1.3× average speedup and 2× on large LLMs over prior CIM-specific compilers (Zhao et al., 24 Feb 2025).

4. Performance Metrics, Variability, and Quantization

Performance, robustness, and energy are dictated by the interaction of circuit properties, array architecture, quantization granularity, and algorithmic characteristics:

Implementation Precision Energy/MAC (fJ) Efficiency (TOPS/W) Area Efficiency (TOPS/mm²) Key Notes
DCIM (BF16) BF16 ~12 73–163 33–91 Scaling, high precision (Yoshioka et al., 2024)
ACIM (charge) 4–8 bits ~1–5 800–4,000 – MOM cap, medium-precision sweet spot (Yoshioka et al., 2024)
FeFET-CIM 8 bits 3.14 fJ 2,866 – Temp-stable, ultra-low energy (Zhou et al., 2023)
SiTe-CiM I/II ternary – – 18–34% area OH 2.5× energy, 7× throughput (Thakuria et al., 2024)

Column-wise quantization of weights and partial sums, with per-column scale factors, reduces ADC overhead up to 16× while incurring less than 1% accuracy loss, and improves variation robustness in noisy or variable NVM cells (Kim et al., 11 Feb 2025). Adaptive quantization-aware training (QAT) coupled with one-stage per-column scaling aligns hardware and algorithmic properties, facilitating robust, efficient large DNN deployment.

Device and circuit noise, PVT sensitivity, and ADC/DAC overhead remain key constraints. For state-of-the-art ACIM, compute-SNR values of 15–30 dB (4–6 effective bits) are sufficient for CNNs but potentially inadequate for transformers or large LLM inference (Yoshioka et al., 2024, Wolters et al., 2024).

5. Modeling, Co-Design, and Tool Support

System-level evaluation frameworks and modeling tools are critical for co-design and optimization across device, circuit, array, and software levels:

  • CiMLoop allows flexible, data-value-dependent energy modeling of arbitrary CIM circuits and architectures, leveraging operand distributions and mapping constraints. Statistical models yield energy estimates within 3–7% of per-value simulators and deliver >1,000× speedups, paramount for early-stage architectural exploration (Andrulis et al., 2024).
  • Eva-CiM combines SPICE, DESTINY, GEM5, and McPAT to provide device-to-architecture performance/energy modeling, evaluating system-level gains, and guiding design trade-offs (e.g., array size vs. integration level vs. device tech), validated within 24% accuracy of lower-level models (Gao et al., 2019).
  • CiMNet and similar frameworks enable joint optimization of DNN micro-architecture and CIM hardware configuration, using Pareto-optimal neural/hardware search (e.g., NSGA-II), routinely achieving >3× acceleration at iso-accuracy relative to models optimized for the network alone (Kundu et al., 2024).

Sophisticated algorithm-hardware co-design is required to exploit sparsity, mixed-precision, error tolerance, and dynamic resource allocation. Compiler and mapping algorithms increasingly manage mode switching, weight-stationarity, tiling, and data reuse for end-to-end optimization (Zhao et al., 24 Feb 2025, Pelke et al., 2024).

6. Applications, Use Cases, and Challenges

CIM architectures are deployed in diverse domains:

  • Edge AI and IoT: Ultra-low-power, temperature-stable arrays (e.g., subthreshold FeFET CIM achieving 2866 TOPS/W) enable always-on ambient intelligence and sensor fusion in battery-constrained or harvested-power environments (Zhou et al., 2023).
  • DNN Acceleration: Mainstream DNN workloads, including ResNet, VGG, MobileNet, and ViT families, benefit from in-situ MACs, quantization-aware mapping, and hybrid digital/analog schemes (Yi et al., 11 Feb 2025, Kim et al., 11 Feb 2025, Kundu et al., 2024).
  • 3D Point Cloud and Sparse Convolution: Dedicated CIM accelerators (Voxel-CIM) combine weight mapping strategies and O(N) off-chip search to achieve 4.5–8× speed/energy gains over GPU/ASIC alternatives in point cloud perception (Lin et al., 2024).
  • Transformers and LLMs: Analog crossbars and hybrid macros efficiently map transformer FC, attention, and feed-forward blocks achieving up to 39× speedup and 643× energy efficiency over high-end GPUs (Wolters et al., 2024, Zhu et al., 1 Mar 2025, Kim et al., 17 Feb 2025).
  • Uncertainty and Edge Bayesian Inference: CIM hardware integrating Monte Carlo dropout, in-memory RNG masking, and compute-reuse optimizations provides low-power Bayesian DNN execution with calibrated prediction uncertainty (Shukla et al., 2021).

Challenges remain in device nonidealities and endurance, peripheral ADC/DAC cost, variation, large-scale mapping, and analog precision limitations. Mitigation involves close algorithm–circuit co-design (CSNR-aware quantization, hardware-aware training), architectural innovations (hybrid blocks, 3D integration), and continued development of analysis and co-optimization tools (Yoshioka et al., 2024, Andrulis et al., 2024, Kundu et al., 2024).

7. Future Directions and Open Problems

Current research and system evaluations identify several open challenges and active research frontiers:

  • Precision and Noise: Pushing analog CIM to support low/medium-precision transformers and LLMs requires new error-resilient training algorithms and hybrid analog-digital schemes; practical accuracy targets are <1% loss vs. baseline for large models (Yoshioka et al., 2024, Wolters et al., 2024).
  • Scalability: Large-scale models exceed on-chip array capacities; folding, streaming, and CIM-friendly off-chip caching must retain the in-memory advantage (Yoshioka et al., 2024, Sharma et al., 2023).
  • Mixed-Precision and Dynamic Adaptation: Hardware that selects precision and sparsity granularity on a per-block or per-layer basis, exploiting architectural heterogeneity for further efficiency (Yoshioka et al., 2024).
  • 3D/Monolithic Integration and Emerging Devices: Next-generation NVM and ferroelectric device integration, and vertical stacking of compute and memory, are needed for ultra-high-capacity CIM (Yoshioka et al., 2024, Zhou et al., 2023).
  • Compiler and Toolchain Support: Generalized CIM-aware compilers capable of mode-switching, adaptive resource allocation, weight mapping, and cross-layer scheduling are essential to translate modern DNN graphs to optimal hardware instructions (Zhao et al., 24 Feb 2025, Pelke et al., 2024).
  • Evaluation Standardization: Open-source, high-accuracy/full-stack modeling tools (e.g., CiMLoop) are critical to standardize the evaluation of new CIM proposals across circuits, architectures, workloads, and mapping methodologies (Andrulis et al., 2024, Gao et al., 2019).

The trajectory of CIM research indicates continued convergence of in-memory compute, device-circuit co-design, mapping/compiler innovation, and application-driven optimization, facilitating robust, scalable, and energy-proportional acceleration of neural computation across deployment domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Compute In Memory (CIM).