Compute-in-Memory Architectures
- Compute-in-Memory (CiM) architectures are innovative designs that integrate computation within memory arrays to perform in-situ arithmetic and logic operations, reducing data transfer energy and latency.
- They employ digital, analog, and hybrid approaches using technologies like SRAM, ReRAM, and PCM, each offering specific trade-offs in area, precision, and energy efficiency.
- CiM architectures significantly boost AI performance, with applications in DNN inference achieving metrics such as 7.3 TOPS/W and substantial throughput improvements over conventional systems.
Compute-in-Memory (CiM) architectures physically integrate computation capabilities within or proximate to memory arrays, enabling bulk arithmetic or logic operations where operands are stored. This approach mitigates the classic von Neumann bottleneck caused by energy-intensive and latency-bound data transfers between separate compute and memory subsystems. CiM spans several technology domains—from digital SRAM and analog/NVM crossbars to emerging 2-terminal memory devices—each with distinct design trade-offs regarding area, energy, throughput, and computational precision. Recent research has established CiM’s potential to transform AI acceleration, particularly in matrix-centric workloads such as deep neural network inference, by up to an order of magnitude in energy and area efficiency over conventional digital platforms (Zhu et al., 1 Mar 2025, Yoshioka et al., 2024, Wolters et al., 2024).
1. Taxonomy of CiM Devices and Core Operational Principles
CiM architectures are principally classified along two axes: device technology (e.g., SRAM, DRAM, ReRAM, PCM, FeFET, Ferroelectric Diode, MRAM, QAHE) and compute style (digital, analog, or hybrid) (Yoshioka et al., 2024, Wolters et al., 2024, Liu et al., 2022, Alam et al., 2021).
- Digital CiM (DCIM): Employs digital logic (e.g., AND/XNOR, bit-serial or bit-parallel adder trees) within or near standard SRAM (6T or 8T) or NVM macros. Enables up to BF16 precision and is compatible with advanced CMOS scaling (Zhu et al., 1 Mar 2025, Yoshioka et al., 2024).
- Analog CiM (ACIM): Utilizes analog accumulation in bitlines, typically via current-mode (Ohm’s law), charge-sharing, or time-domain encoding, followed by ADC-based quantization. Provides much higher area and energy efficiency at low to moderate precisions (3–8b), but is challenged by noise and process/voltage/temperature (PVT) variation (Kim et al., 2022, Yoshioka et al., 2024, Liu et al., 2024).
- Hybrid CiM: Co-locates digital and analog compute within the same macro. Often computes MSBs in digital (for accuracy), LSBs in analog (for efficiency), or dynamically assigns precision via workload-driven heuristics (Yoshioka et al., 2024, Yi et al., 11 Feb 2025).
Typical crossbar arrays (ReRAM, PCM, FeFET) store weights as multi-level conductances, apply input voltages, and exploit Kirchhoff’s law for in-memory matrix–vector multiplication (MVM). SRAM-based DCIM and ACIM exploit wordline/bitline activations for in-situ accumulation (Zhu et al., 1 Mar 2025, Kim et al., 2022). Two-terminal FeD crossbars achieve reconfigurability across TCAM, dense non-volatile storage, and analog MVM operations (Liu et al., 2022).
2. Architectural Implementations and Circuit Techniques
Digital SRAM-Based CiM
Advanced architectures implement bit-serial accumulation and local adder trees within the SRAM periphery, replacing energy/area-heavy multiply–accumulate (MAC) units in e.g., digital systolic arrays (Zhu et al., 1 Mar 2025). A CiM-enabled TPU replaces conventional 128×128 MAC arrays with a grid of CiM MXU tiles, each leveraging 6T-SRAM cells integrated with popcount trees for in-situ, bit-serial MAC (Zhu et al., 1 Mar 2025). State-of-the-art digital CiM achieves 7.3 TOPS/W and 2× area improvement at identical throughput to classic digital arrays.
Analog and Charge-Domain ACIM
Analog-domain CiM employs current, charge, or time as computation carriers. Charge-domain macros (e.g., P-8T SRAM with charge-sharing DAC/ADC) enable variation-tolerant, highly linear MAC outputs, supporting high-accuracy DNN inference with 50 TOPS/W at 0.6 V (Kim et al., 2022, Yoshioka et al., 2024). Current-mode implementations in NVMs and ACIM exploit Ohmic summation in crossbar columns, frequently yielding 300+ TOPS/W at moderate bit-depth (Wolters et al., 2024).
Floating-Point and Hybrid CiM
Recent work in floating-point CiM splits the multiplicative datapath into high-significance addends (digital) and low-significance products (analog, e.g., switched-capacitor MAC in SRAM), preserving end-to-end DNN accuracy with only 1.5× area/energy cost over pure digital (Yi et al., 11 Feb 2025). Analog-domain floating-point CiM (FP8 E2M5, RRAM-based) leverages dynamic-range-adaptive FP-ADCs to encode high-SNR MAC outputs, delivering 19.89 TFLOPS/W and 1474.56 GOPS at 8b precision (Liu et al., 2024).
Multi-Functionality and Emerging Devices
Transistor-free architectures (e.g., AlScN Ferroelectric Diode crossbars) realize storage, search (TCAM), and neural operations in the same array, achieving <0.12 μm²/cell, non-volatility, and competitive (<10 fJ/MAC) energy (Liu et al., 2022). Cryogenic CiM based on QAHE elements demonstrates sub-10 fJ/bit and robust single-cycle binary logic at 4 K, promising paths toward post-CMOS, ultra-low-power compute (Alam et al., 2021).
3. System-Level Integration and Dataflow Patterns
Dataflow, Scheduling, and Integration
CiM architectures adopt output-stationary or weight-stationary dataflows wholly within memory arrays to maximize weight/data reuse and minimize external memory accesses (Zhu et al., 1 Mar 2025, Sharma et al., 2023, Yoshioka et al., 2024). Systolic or tiled grids of CiM cores implement matrix–vector and matrix–matrix operations required for transformer or CNN workloads, with standard tiling, double buffering, and pipeline scheduling (Zhu et al., 1 Mar 2025, Pelke et al., 2024, Qu et al., 2024). Application-aware mapping—guided by workload arithmetic intensity (AI) and resource bottlenecks—yields up to 3–8× acceleration and commensurate energy reduction over both naive CiM allocation and conventional compute (Sharma et al., 2023, Pelke et al., 2024, Crafton et al., 2020).
Crossbar-Scale Workload Partitioning
Fine-grained blockwise scheduling, e.g., partitioning convolutional layers into activation- and data-dependent blocks, dynamically maximizes array utilization (>92%) and throughput (>7.4× over naive) in eNVM fabrics (Crafton et al., 2020). Compiler stacks (e.g., CIM-MLC, CIMFlow) further optimize mapping by multi-level abstraction—core/crossbar/wordline decomposition—enabling flexible deployment across diverse CiM architectures and improving DNN throughput by 2–4× (Qu et al., 2024, Qi et al., 2 May 2025).
Support for Dual-Mode and Adaptive Reconfiguration
Dual-mode CiM architectures expose arrays’ reconfigurability between compute (in-situ MAC) and scratchpad storage; compiler frameworks dynamically allocate arrays to maximize throughput or memory capacity per-inference segment, closely tracking workload AI and phase (Zhao et al., 24 Feb 2025). This reconfiguration can yield 1.3× mean acceleration for large DNNs, further amortized by mode-switch control granularity.
4. Quantitative Performance, Energy, and Accuracy Metrics
Area, Power, and Throughput
Digital CiM macro area is typically 50% that of equivalent digital MAC arrays, while energy per MAC improves by 9–27× (7.26 TOPS/W vs. 0.77 TOPS/W in MXUs, TPU context) (Zhu et al., 1 Mar 2025). Analog/charge-based macros deliver 22–50 TOPS/W at 28 nm, and state-of-the-art ACIM and hybrid macros exceed 1000 TOPS/W in the binary/ternary domain (Kim et al., 2022, Yoshioka et al., 2024, Zhang et al., 2024). Floating-point analog CiM (RRAM, E2M5) achieves up to 19.89 TFLOPS/W and 4.1–5.4× higher efficiency than digital FP8 or BF16 accelerators (Liu et al., 2024).
Accuracy
Proper charge-sharing and low-resolution ADC design can approach software baseline accuracy with <1% loss (e.g., 91.46% on CIFAR-10/ResNet-20 with 4b P-8T SRAM CiM) (Kim et al., 2022). For hybrid analog/digital FP CiM, worst-case error is ≤1.6% aggregated (sub-ADD digital, sub-MUL analog), with <1.3% drop in top-line accuracy on ImageNet and SQuAD tasks (Yi et al., 11 Feb 2025). Sparsity-centric architectures deploying approximate probabilistic MACs further reduce cycles and memory accesses by 50–81% with sub-1% average accuracy degradation (Zhang et al., 2024).
DNN/LLM Acceleration
CiM MXUs in TPUs yield up to 44.2% and 33.8% throughput improvement on GPT-3 (30B) and DiT-XL/2, with order-of-magnitude reduction in MXU power (Zhu et al., 1 Mar 2025). In transformer LLM inference, both DRAM-PIM and ReRAM-based CiM achieve >20–1000× energy and >10–100× throughput benefit vs. NVIDIA T4 GPU baselines at 4–16b precision (Wolters et al., 2024).
The table below summarizes representative metrics from state-of-the-art implementations.
| Architecture (Ref) | Device | Precision | Area Efficiency (TOPS/mm²) | Energy Efficiency (TOPS/W) |
|---|---|---|---|---|
| CiM MXU (TPU) (Zhu et al., 1 Mar 2025) | 6T-SRAM | INT8/BF16 | 1.31 (2.0× digital) | 7.26 (9.4× digital) |
| P-8T SRAM CiM (Kim et al., 2022) | 8T-SRAM | 4b×8b | 45.5 GOPS/2KB | 22.2–50.1 |
| Hybrid FP-CiM (Yi et al., 11 Feb 2025) | SRAM+Analog | FP16/FP8 | +42.8% area (over 6T) | 19.5 (1.5× digital FP) |
| AFPR-CIM (Liu et al., 2024) | RRAM | FP8/E2M5 | – | 19.89 TFLOPS/W |
| PACiM (Zhang et al., 2024) | Hybrid | 8b | – | 14.63 (8b/8b) |
5. Design Challenges, Reliability, and System-Level Co-Optimization
Nonidealities and Precision Limitations
Device variability (write noise, retention drift, IR-drop) and limited ADC precision can impose SNR constraints; for transformers, CSNR≥30 dB is typically needed, while CNNs tolerate down to ≈15 dB (Yoshioka et al., 2024, Wolters et al., 2024). Hybrid and digital-analog split approaches mitigate these limitations by computing MSBs digitally and LSBs analog with saliency-driven bit boundary tuning (Yoshioka et al., 2024).
Reliability and Error-Correction
High-precision (FP16+) CiM is acutely vulnerable to soft errors in exponent bits: exponent bit-flips induce catastrophic model collapse, while mantissa flips are benign (Li et al., 2 Jun 2025). Algorithm-hardware co-design leveraging block-exponent alignment and lightweight block-level ECC (One4N, optimal N=8) can recover model accuracy at BER=10⁻⁶ with <9% logic and <1.5% power overhead (Li et al., 2 Jun 2025).
Compiler, Mapping, and DNN Co-Design
The performance/energy co-optimization requires hierarchy-aware mapping and scheduling. Recent frameworks (e.g., CIM-MLC, CiMNet, CIMNAS) simultaneously optimize hardware configuration (array size, BW, memory size), DNN architecture (layer, width/depth), and quantization (per-layer bit-width) (Kundu et al., 2024, Qu et al., 2024, Krestinskaya et al., 30 Sep 2025, Kundu et al., 2024). Evolutionary HW-NAS delivers ≥90–800x EDAP (energy-delay-area product) reduction while preserving accuracy (<1% loss) (Krestinskaya et al., 30 Sep 2025).
Approximate and sparsity-centric methods, such as PACiM, further compress required memory access volume and bit-serial compute cycles by 50–81% while maintaining near-baseline accuracy (Zhang et al., 2024).
6. Application Domains and Future Directions
Deep Neural and Generative Model Acceleration
CiM architectures are now evaluated as first-class AI accelerators for transformer-based LLM inference, diffusion models, and large-scale vision networks (Zhu et al., 1 Mar 2025, Wolters et al., 2024). The intrinsic matrix-centric execution of transformers (attention, FFNs) maps efficiently to both digital and analog CiM macros, while peripheral nonlinear ops (softmax, LayerNorm) are handled digitally or in hybrid analog/digital periphery.
Advanced Devices, 3D Integration, and Cryogenic Compute
Emerging research in ferroelectric diodes, QAHE-based cells, and MRAM offers prospects for lowering area even further, supporting parallel search and non-volatile storage, and extending CiM applications to new domains (e.g., cryogenic or quantum controllers) (Liu et al., 2022, Alam et al., 2021). Hybrid 3D stacking (e.g., DRAM+CiM) and on-chip training support, as well as adaptive-mixed-precision and structured error-correction, remain open directions (Zhu et al., 1 Mar 2025, Yoshioka et al., 2024, Li et al., 2 Jun 2025). Compiler frameworks capable of runtime mode adaptation (compute/memory) unlock utilization during dynamic and heterogeneous workloads (Zhao et al., 24 Feb 2025, Qu et al., 2024).
7. Summary and Prospects
Compute-in-Memory architectures—spanning digital, analog, and hybrid implementations across SRAM, NVM, and emerging devices—achieve substantial improvements in area and energy efficiency, directly address the von Neumann bottleneck, and present strong performance–cost trade-offs for accelerating state-of-the-art deep learning workloads. Optimization of array structure, dataflow, precision adaptation, reliability techniques, and system-level co-design is essential for full realization of CiM’s potential. Robust compilation and joint hardware–software optimization frameworks will remain central to deploying CiM at scale across next-generation AI applications (Zhu et al., 1 Mar 2025, Yoshioka et al., 2024, Krestinskaya et al., 30 Sep 2025).