Compute-In-Memory (CIM) Architecture

Updated 6 March 2026

Compute-In-Memory (CIM) is an innovative architecture that performs arithmetic operations directly within memory arrays, reducing data movement and energy costs.
It leverages digital, analog, and hybrid implementations to execute MAC operations in situ, achieving significant gains in parallelism and efficiency.
CIM is key for scaling deep neural networks and other AI systems, with applications ranging from edge AI to large language model inference.

Compute-In-Memory (CIM) is an architectural approach in which fundamental arithmetic operations, most commonly multiply-accumulate (MAC), are performed directly within memory arrays rather than in discrete compute units, with the explicit aim of mitigating the “memory wall” resulting from the high energy and latency costs of data transfers in von Neumann architectures. By co-locating storage and computation, CIM enables high parallelism, reduces on-chip and off-chip traffic, and achieves significant gains in area and energy efficiency—an outcome crucial for scaling deep neural networks (DNNs), LLMs, and a broad spectrum of data-intensive AI workloads.

1. Core Principles of Compute-In-Memory

CIM fundamentally departs from traditional processor-memory separation by embedding computation (e.g., MAC, bitwise logic) within memory arrays, such as SRAM, ReRAM, PCM, or FeFET crossbars. In a canonical analog CIM operation, input activations are encoded as voltages on word lines, memory cell conductances encode weights, and the physical summation of currents along the bit lines implements a vector-matrix multiply via Ohm’s and Kirchhoff’s laws. The outputs are digitized through peripheral analog-to-digital converters (ADCs) for subsequent processing or accumulation (Wolters et al., 2024).

This approach yields:

Substantial reduction in data movement, often an order of magnitude decrease in energy and latency compared to moving data to separate processing units (Song et al., 20 Aug 2025, Sharma et al., 2023).
Very high parallelism, as entire rows or columns of the memory array perform computation concurrently.
An architectural flexibility that allows integration at various levels of the memory hierarchy—including L1/L2 SRAM, high-density nonvolatile (eNVM), or as embedded functional blocks in domain-specific accelerators (Sharma et al., 2023, Gao et al., 2019).

2. CIM Circuit and Device Implementations

CIM can be realized in several device and circuit flavors:

Digital CIM (DCIM): Each cell implements digital logic (e.g., AND, XNOR), often with periphery adder trees. Bit-serial and bit-parallel approaches enable scalable precision. DCIM offers high accuracy and scaling to advanced technology nodes but incurs area/power cost in the adder tree (Yoshioka et al., 2024).
Analog CIM (ACIM): Both multiplication and accumulation are performed in the analog domain—using current-mode (I=GV), charge-mode (Q=CV), or time-mode (using delays) integrators. ACIM delivers unmatched energy/area efficiency at medium precision (3–8b) due to analog summing but faces precision and variability limitations (Yoshioka et al., 2024).
Hybrid CIM: Recent architectures combine DCIM (for MSBs) and ACIM (for LSBs/data-driven partitioning) to optimize the accuracy–energy trade-off. Hybrid cells can dynamically switch roles depending on data saliency or desired precision (Yoshioka et al., 2024, Yi et al., 11 Feb 2025).
Device Technologies: Implementations span mature 6T/8T SRAM for digital CIM, high-density ReRAM/PCM/FeFET for analog and mixed-signal CIM, with temperature- and process-variation-resilient designs now demonstrated for subthreshold operation (e.g., FeFET arrays stable over 0–85°C (Zhou et al., 2023)).

Table: Key CIM Memory Technologies (Wolters et al., 2024)

Technology	CIM Mode	Multi-Level	Cell Area	Peak TOPS/W	Comments
SRAM (6T)	Digital	No	160 F²	~32	Process scalable
ReRAM	Analog	Yes	16 F²	~400–4000	eNVM, high density
PCM	Analog	Yes	16 F²	~30	Slower writes
FeFET	Analog	Yes	16 F²	~3,000	CMOS compatible
MRAM	Digital	No	30–80 F²	>30	Low leakage, no MLC

3. Quantization, Precision, and Error Management

Low-precision operation is central to CIM’s efficiency, but imposes notable accuracy challenges:

Quantization Schemes: Fine-grained quantization of both weights and partial sums directly at the array level is essential, especially given cell-level bit limitations and low ADC resolutions. Recent advances, such as column-wise quantization (assigning independent scale factors per column for weights and partial sums), ensure that rounding errors are adaptively minimized, maximizing the use of the representable dynamic range and boosting overall model accuracy. Experimental results show 1–2% top-1 accuracy improvements and enhanced robustness to device variation (Kim et al., 11 Feb 2025).
Adaptive Granularity: Aligning quantization granularity (e.g., per-column or per-group) enables joint optimization of the noise introduced by weight and partial-sum quantization, removing the need for two-stage quantization-aware training and simplifying implementation (Kim et al., 11 Feb 2025).
Device Variation Compensation: Architectural and algorithmic solutions exploit device physics (feedback compensation in subthreshold FeFETs (Zhou et al., 2023)), as well as error-aware scheduling and programmable trade-off between error and throughput (as in the “Counting Cards” algorithm, which allocates wordline parallelism per sub-operation to maximize performance within given error bounds) (Crafton et al., 2020).

4. CIM System Architecture and Dataflows

CIM is architected as large, tiled crossbar arrays tightly coupled to a memory/computation hierarchy:

Hierarchical Integration: CIM macros can be deployed in register files, shared memory, or as L1/L2 cache compute engines, with analytical and empirical studies demonstrating up to 15.6× throughput and 8.3× energy improvements (analog macros) for GEMM workloads versus baseline tensor-cores (Sharma et al., 2023).
Dataflows: Weight-stationary, output-stationary, and input-stationary mapping determine which operand remains local to minimize memory movement. Recent “kernel duplication” dataflows for depthwise convolution maximize spatial reuse and drastically reduce buffer traffic—up to 87% reduction compared to the conventional weight-stationary dataflows—across MobileNet and EfficientNet models (Song et al., 20 Aug 2025).
Group and Bit-Split Convolution: Array-level tiling, group convolutions, and handling multi-cell per-bit weights efficiently map high-dimensional DNN kernels without expensive data flattening (Kim et al., 11 Feb 2025).
Flexible Operand Precision and Shape: Architectures supporting arbitrary operand bitwidth and hybrid stationarity further enhance utilization and energy efficiency (see FlexSpIM for SNNs (Chauvaux et al., 2024)).

5. Modeling, Co-Design Tools, and Compiler Integration

CIM system design mandates end-to-end modeling tools and cross-layer co-optimization:

Architecture Modeling: Tools such as Eva-CiM and CiMLoop allow multi-level (device → array → architecture → system) energy and performance estimation, integrating detailed SPICE/array modeling with ISA-level offloading analysis (Gao et al., 2019, Andrulis et al., 2024). Data-value-aware energy models capture the influence of operand distributions on total energy, and statistical abstractors yield >1,000× speed-up over detailed simulators while maintaining <7% energy estimation error (Andrulis et al., 2024).
Mapping Strategies: Loop-nest mapping and area-constraint-aware tiling algorithms, such as those in WWW (Sharma et al., 2023), identify optimal integration points (e.g., register file versus shared SRAM) and when/where to deploy CIM for throughput or energy optimization.
Compilation and Dual-Mode Switching: Advanced compilers such as CMSwitch explicitly exploit CIM arrays’ capability to switch between compute and memory modes dynamically, enabling optimal resource reallocation between MAC compute and buffer (KV cache) capacity on a per-segment basis for DNN and LLM workloads. This achieves mean 1.31× speedup over prior CIM accelerators (Zhao et al., 24 Feb 2025).

6. Performance Metrics, Efficiency Gains, and Application Domains

Key performance indicators for CIM include throughput (TOPS/W), energy per operation (fJ–pJ/MAC), area efficiency (TOPS/mm²), and application-level accuracy. Representative findings:

Energy efficiency for SRAM-based DCIM exceeds 32 TOPS/W at INT8–INT12, while charge-based ACIM can reach >800 TOPS/W at moderate precision (Yoshioka et al., 2024).
CIM architectures for 3D pointcloud networks (Voxel-CIM) achieve 4.5–7.0× energy efficiency and 2.4–8.1× speedup compared to standard accelerators on detection/segmentation tasks (Lin et al., 2024).
Analog–digital hybrid FP8 CIM macros demonstrate 1.53× energy reduction with <1% model accuracy loss versus digital baselines (Yi et al., 11 Feb 2025).
System-level speedup values: up to 7.47× (block-wise allocation vs naive layer-wise) for ResNet-18, with near 90% array utilization (Crafton et al., 2020).
Realized robustness to process variation (e.g., ±10% device variation) with per-column quantization (Kim et al., 11 Feb 2025).

CIM is now deployed in edge AI, LLM inference, event-based SNN processing, Bayesian inference (via MC-dropout in-memory sampling (Shukla et al., 2021)), and even in cryogenic control where topologically protected memory states enable single-cycle, ultra-low-energy Boolean logic (CryoCiM) (Alam et al., 2021).

7. Challenges, Trade-Offs, and Future Directions

Persistent research challenges include:

Precision/Scalability: Achieving >10-bit accuracy with analog CIM is limited by device mismatch, PVT variation, and ADC resolution. Hybrid and error-aware schemes offer some relief (Yoshioka et al., 2024).
Peripheral Overheads: ADCs remain the dominant area/energy consumer in most analog CIM designs. Solutions include collaborative memory-immersed SAR digitization, which reduces ADC area by up to 51×, enabling more arrays per die and overall higher parallelism (Nasrin et al., 2023).
Variation and Reliability: Adaptive calibration, feedback circuits, and fine-grained scale factor learning mitigate, but do not eliminate, the effects of process and environmental change.
Mapping and Architecture Flexibility: Dual-mode, shape-reconfigurable, and bitwise-precision-adaptive CIM macros enable dynamic co-optimization with evolving DNN workloads, but increase control complexity (Chauvaux et al., 2024, Zhao et al., 24 Feb 2025).
Integration with Established Accelerators: Drop-in CIM MXUs for TPUv4i replace conventional systolic arrays and yield >27× energy reduction for memory-bound workloads with no software ecosystem changes (Zhu et al., 1 Mar 2025).

Open research directions include algorithm–hardware co-design that directly incorporates CIM non-idealities and quantization constraints, 3D integration of CIM with logic layers, runtime/compilation frameworks that dynamically remap compute and memory modes, and robust implementations in unreliable or variable environments.

By collapsing compute and memory boundaries, CIM stands as a foundational shift for post-von Neumann computing, offering sustainable architectural scaling in AI systems as data- and model-sizes continue to outstrip conventional memory and compute technology advances.

References:

(Kim et al., 11 Feb 2025, Yoshioka et al., 2024, Sharma et al., 2023, Thakuria et al., 2024, Gao et al., 2019, Lin et al., 2024, Zhu et al., 1 Mar 2025, Andrulis et al., 2024, Zhao et al., 24 Feb 2025, Nasrin et al., 2023, Zhou et al., 2023, Song et al., 20 Aug 2025, Crafton et al., 2020, Alam et al., 2021, Chauvaux et al., 2024, Yi et al., 11 Feb 2025, Shukla et al., 2021, Crafton et al., 2020, Malhotra et al., 2022, Wolters et al., 2024)