Compute-In-Memory (CiM) Overview
- Compute-In-Memory (CiM) is a paradigm that integrates arithmetic and logic within memory arrays, reducing data movement and overcoming the memory wall.
- CiM architectures—digital, analog, and hybrid—enable massively parallel, energy-efficient computations for applications like DNN inference and associative search.
- Recent advances focus on full-stack modeling, algorithm-hardware co-optimization, and robust mapping strategies to enhance performance and scalability.
Compute-In-Memory (CiM) is a device- and architecture-level paradigm that integrates arithmetic and logic operations directly within the memory array, addressing the “memory wall” bottleneck endemic to conventional von Neumann systems. By collapsing computation and storage, CiM architectures minimize data movement and enable massively parallel, low-energy operation for critical workloads such as deep neural network (DNN) inference, numerical kernels, and associative search. The following sections synthesize the structural, algorithmic, and system-level advances in CiM, drawing on recent primary research and comprehensive modeling efforts.
1. Definition and Architectural Principles
CiM architectures physically embed multiply-accumulate (MAC) or logical operations in or immediately adjacent to memory arrays, unifying processor and storage in high-density structures. In SRAM-based digital CiM, standard bitcells are repurposed as logic units performing bit-wise operations (AND, XNOR, etc.) and digital accumulation using local adder trees or bit-serial approaches. Analog CiM leverages nonvolatile memory crossbars (e.g., ReRAM, FeFET, PCM) where input voltages applied to word lines interact with stored conductances, resulting in column-wise analog accumulation governed by Kirchhoff’s and Ohm’s laws. This native parallelism enables an dot-product or bitwise logic operation across an entire crossbar or array slice in a single (or few) cycles (Yoshioka et al., 2024, Wolters et al., 2024).
The architectural motivation is the overwhelming energy and latency cost of shuttling operands between memory and computing units in classical designs, especially as DNN models and data sets grow. By making weights stationary and steering only input vectors, CiM arrays can execute high-intensity operations—such as GEMMs in DNN inference—with orders-of-magnitude greater efficiency than separated designs (Sharma et al., 2023, Kundu et al., 2024, Andrulis et al., 2024).
2. Taxonomy and Physical Implementations
CiM arrays are broadly categorized as digital (DCIM), analog (ACIM), or hybrid.
Digital CiM (DCIM)
- Implemented within standard CMOS memories (SRAM, DRAM, MRAM).
- Bitcells are augmented to execute logic functions or bit-serial MACs, with accumulation in digital adder trees (DAT) or local registers.
- Supports high arithmetic precision (8–16 bit fixed-point, BF16, etc.), and benefits from lithography scaling (Yoshioka et al., 2024, Gao et al., 2019).
- Example: In TPU architectures, digital CiM cores replace conventional systolic MXUs, preserving weight-stationary and output-stationary dataflows while enabling up to 2× area, 9–27× energy, and up to 44% throughput improvement versus standard designs (Zhu et al., 1 Mar 2025).
Analog CiM (ACIM)
- Built from crossbar arrays employing emerging NVMs: ReRAM, FeFET, PCM.
- Arithmetic is realized by mapping weights to conductances, and applying input voltages, producing output currents or charges summed and digitized via ADCs.
- Multiple modalities:
- Current-domain: Passive current summation on bitlines, high energy efficiency for up to 6-bit precision.
- Charge-domain: Charge redistribution across MOM capacitors, with ADC readout, enabling up to 12-bit precision.
- Time-domain: Encoding multiplication as pulse delays, summed in time before digitalization (Yoshioka et al., 2024).
- Recent subthreshold FeFET arrays exploit extremely low supply voltages but require robust design to ameliorate temperature drift (Zhou et al., 2023, Zhou et al., 2 Jan 2025).
- Example: Subthreshold-FeFET CiM achieves 2866 TOPS/W at 14 nm, sustaining 8-bit VGG/CIFAR-10 inference accuracy over 0–85°C (Zhou et al., 2023). TReCiM extends this to multibit storage, achieving 91.31% CIFAR-10 accuracy and 48 TOPS/W (Zhou et al., 2 Jan 2025).
Hybrid CiM
- Combines DCIM for high-contribution MSBs (zero-error) with ACIM for LSBs (energy-efficient but tolerant of analog noise).
- Example: 8-bit MAC macros with digital MSBs and charge-domain analog LSBs reduce total energy by up to 2–4× versus fully digital (Yoshioka et al., 2024, Yi et al., 11 Feb 2025).
3. Quantitative Performance, Energy, and Design Exploration
CiM device and system modeling frameworks enable comprehensive, rapid exploration of energy–performance–area tradeoffs across technology nodes, array organizations, and mapping strategies.
Full-stack Modeling and Evaluation
- CiMLoop is an open-source full-stack CiM model supporting devices, circuits, architectures, workloads, and mapping in a unified container-hierarchy language (Andrulis et al., 2024).
- Flexible YAML-based specification allows arbitrary topologies (e.g., digital/analog macros, hierarchical interconnects).
- Data-value-dependent and statistical energy models capture operand-dependent behaviors, circuit non-idealities, and execution counts.
- Integration with Timeloop provides systematic DNN mapping and spatial/temporal tiling sweep.
- Demonstrated 3% mean energy error versus NeuroSim with 1000× speedup.
Key Performance Metrics and Comparative Results
| Parameter | Digital CiM | Analog CiM | Hybrid CiM |
|---|---|---|---|
| Precision | 8–16 bit, FP | 4–12 bit | 10–12 bit |
| Energy Efficiency | 10–100 TOPS/W | 100–4000 TOPS/W | 100–2000 TOPS/W |
| Area Overhead | High (DAT) | Low–Moderate | Medium |
| MAC Latency | ~1 ns | 1–10 ns | ~1–5 ns |
- Experimental studies confirm CiM can yield 1.3–8× energy savings and 1.2–15.6× throughput gains on DNN workloads, even in commercial nodes (14 nm–40 nm) (Sharma et al., 2023, Zhu et al., 1 Mar 2025, Lin et al., 2024).
- Device and circuit innovations (e.g., SiTe CiM for ternary DNNs (Thakuria et al., 2024), cryogenic CiM via QAHE (Alam et al., 2021), clamp-based, multi-level FeFET for temperature resilience (Zhou et al., 2 Jan 2025)) further extend the efficiency and robustness of CiM fabrics.
4. Mapping, Dataflow, and Compiler Methodologies
Robust mapping and compiler infrastructures are essential to realize the system-level benefits of CiM.
- Weight-Stationary and Output-Stationary Flows: Optimizing data reuse by keeping weights or outputs fixed in the array and streaming the complementary operand.
- Hybrid Weight/Output Stationarity: Layer-wise selection between stationary operands to minimize on-chip and DRAM traffic, e.g., FlexSpIM achieves up to 90% system energy savings for event-based SNNs (Chauvaux et al., 2024).
- Column-wise Quantization: Matching the quantization granularity of weights and partial sums at the column level improves DNN accuracy and robustness while enabling low-precision ADCs and simplified dequantization (Kim et al., 11 Feb 2025).
- Block- and Data-driven Mapping: Block-wise allocation and dataflow optimize array utilization and relieve synchronization barriers, yielding up to 7.47× throughput improvement in eNVM-CiM (Crafton et al., 2020).
- Co-Design for DNNs and Hardware: Frameworks such as CiMNet perform joint optimization of sub-networks and hardware parameters, achieving up to 3.6× execution speedup at matched accuracy (Kundu et al., 2024).
5. Domain-specific Architectures and Applications
CiM is proliferating across a wide array of domains, including:
- Voxel-based 3D Perception: Voxel-CIM employs depth-encoding output-major search and sub-matrix weight mapping to achieve 4.5–7× higher energy efficiency and up to 8.1× speedup on 3D point cloud neural networks (Lin et al., 2024).
- LLM Inference: Memory-centric CiM accelerators (ReRAM, FeFET, DRAM+NVM) deliver 10×–1000× energy reduction and up to 137× speedup versus GPU LLM baselines by mapping matrix-vector multiplications and attention ops into analog crossbars (Wolters et al., 2024).
- Neuro-Symbolic and Associative Workloads: Charge-domain 1FeFET-1C CiM arrays realize both MAC and XNOR-based associative search (CAM) natively, providing >1000× energy reduction and dual-mode flexibility for neuro-symbolic AI (Yin et al., 2024).
- Spiking Neural Networks (SNNs): FlexSpIM’s unified storage and hybrid-stationary dataflows deliver μs-inference latency and state-of-the-art 95.8% accuracy on IBM DVS gesture datasets (Chauvaux et al., 2024).
- Bayesian DNNs and Stochastic Inference: MC-CIM embeds in-memory random number generation and compute-reuse for probabilistic DNN inference, reducing the energy of MC-Dropout by 43% (Shukla et al., 2021).
6. Open Challenges, Trade-offs, and Research Directions
Key technical challenges remain at the intersection of devices, circuits, mapping, and systems:
- Precision–Efficiency Trade-off: Analog CiM excels at low/medium precision; digital/hybrid approaches are needed for 8–16 bit and floating-point tasks (Yoshioka et al., 2024, Yi et al., 11 Feb 2025).
- Device Non-idealities: Mitigation of programming noise, temperature drift, and process variation is critical—approaches include feedback circuits (e.g., 2T-1FeFET), algorithm-level retraining, and data-driven quantization (Zhou et al., 2023, Zhou et al., 2 Jan 2025, Kim et al., 11 Feb 2025).
- Array Size and Peripheral Overhead: ADC/DAC area and energy can dominate in ACIM; approaches such as column-wise quantization and group-convolution can relax ADC requirements (Kim et al., 11 Feb 2025).
- Endurance and Scaling: NVM endurance still limits frequent reprogramming (particularly for dynamic DNNs); endurance-aware mapping and hybrid analog/digital architectures offer partial solutions (Wolters et al., 2024).
- Compiler and Run-time Systems: Dynamic mode-switching (CMSwitch) and per-segment resource allocation substantially increase end-to-end performance (1.31× over fixed-mode compilers) by exploiting workload heterogeneity (Zhao et al., 24 Feb 2025).
Continued algorithm–hardware co-optimization, holistic toolchains (full-stack/loop-mapping/EDA), and device advances (3D NVM, scalable FeFET, cryogenic fabrics (Alam et al., 2021)) are active research focuses.
7. Summary Table of Representative CiM Implementations
| Approach | Key Device | Precision | Energy Efficiency (TOPS/W) | Notable Feature | Reference |
|---|---|---|---|---|---|
| Digital SRAM CiM | 6T/8T SRAM | 8–16 bit | 10–100 | High-precision, scalable | (Yoshioka et al., 2024, Zhu et al., 1 Mar 2025) |
| Analog FeFET (2T/1T, MLC) | FeFET | 1–8 bit | 48–2866 | Temperature resilience, subthreshold | (Zhou et al., 2023, Zhou et al., 2 Jan 2025) |
| ReRAM-based Analog CiM | 1T1R ReRAM | 1–8 bit | 26.7 – >1000 | Multi-level conductance, DNN mapping | (Wolters et al., 2024) |
| Hybrid FP8 SRAM Macro | 6T SRAM + pseudo-logic | FP8 | ~50 | Digital ADD, analog MUL, 3-bit ADC | (Yi et al., 11 Feb 2025) |
| SiTe Ternary SRAM CiM | 8T SRAM/3T eDRAM | 2 bit (±1,0) | Up to 7× NM/TiM | Area/speed trade-off, ultra-low precision DNNs | (Thakuria et al., 2024) |
| Dual-mode FeFET-DRAM CAM | 1FeFET–1C | 1 bit | >1000× vs GPU | MAC + associative search, DRAM compatibility | (Yin et al., 2024) |
| Cryogenic QAHE CiM | QAHE graphene | 1 bit | ~0.01 fJ/bit | Single-cycle logic, 4 K operation | (Alam et al., 2021) |
This corpus reflects the technical depth and diversity of current CiM research, with clear progress toward scalable, efficient, and robust in-memory computation for a broad range of AI and edge workloads. Continued integration of device, circuit, architecture, mapping, and compiler research remains central to future advances in the field.