Compute-in-Memory (CiM) Platforms
- Compute-in-Memory (CiM) platforms are architectures that merge storage and computation in the same substrate, reducing data movement and energy consumption.
- They leverage diverse device technologies like SRAM, DRAM, FeFET, RRAM, and PCM to enable both digital and analog operations for high-throughput, energy-efficient processing.
- Algorithm–hardware co-design in CiM systems enables native vector–matrix computations, optimizing performance in AI, combinatorial optimization, and scientific computing tasks.
Compute-in-Memory (CiM) Platforms
Compute-in-memory (CiM) platforms perform arithmetic and logic operations directly within memory arrays—eliminating or substantially reducing data movement between storage and processing units typical of von Neumann architectures. By collapsing computation and storage into the same physical substrate, CiM systems aim to break the "memory wall," enabling high-throughput and energy-efficient acceleration of data-intensive workloads in artificial intelligence, combinatorial optimization, scientific computing, and edge inference (Qian et al., 19 Dec 2025, Thakuria et al., 2024, Zhao et al., 24 Feb 2025, Gao et al., 2019, Yoshioka et al., 2024).
1. Hardware Design Primitives and Device Technologies
Modern CiM platforms utilize a diverse range of device technologies and circuit organizations to implement in-memory computation:
- SRAM- and DRAM-Based Platforms: Digital CiM employs modified 6T/8T SRAM cells or 1FeFET-1C DRAM structures, with in-cell logic or peripheral gating to enable Boolean and arithmetic operations (Yoshioka et al., 2024, Yin et al., 2024). Charge- or current-domain summation enables analog CiM in SRAM structures.
- Nonvolatile Memories and Emerging NVMs: FeFETs, RRAM, PCM, MRAM, and field-programmable ferroelectric diodes (FeDs) are used to realize analog dot-product via conductance-modulated current flow, or selector-free multi-mode arrays for search/memory/MAC (Qian et al., 19 Dec 2025, Zhou et al., 2 Jan 2025, Liu et al., 2022).
- Integration Hierarchy: CiM arrays are hierarchically organized as cores (chip-level), arrays (core-level), and crossbars (array-level). Peripheral circuits include custom DAC/ADC pipelines, sense amplifiers, full-adders, and control logic for compute-memory dual-mode support (Zhao et al., 24 Feb 2025, Qu et al., 2024).
- Technology-Specific Flows: CMOS-compatible BEOL integration (e.g., metal–ferroelectric–metal FeFET stacks), selector-free diode crossbars, high-endurance ferroelectric/oxide stacks, and custom annealing steps for reliability (Qian et al., 19 Dec 2025, Liu et al., 2022).
Device physics—such as threshold voltage programming in FeFETs, quantized conductance in RRAM/PCM, and topologically protected states in QAHE—directly determine operational precision, nonvolatility, retention, and scalability (Alam et al., 2021, Yin et al., 2024).
2. Compute Primitives and Algorithmic Mapping
CiM arrays natively support a range of vector–matrix and matrix–matrix primitives essential for AI, optimization, and scientific applications:
- Vector–Matrix Multiplication (VMM): Analog summation in crossbar arrays, where exploits Ohm’s law for high-throughput multiply–accumulate (MAC) operations (Qian et al., 19 Dec 2025, Chauvaux et al., 2024).
- Vector–Matrix–Vector (VMV) and Quadratic Forms: Simultaneous input application on dual wordlines/source-lines allows dot-product evaluation of quadratic energy functions, central to Ising models and combinatorial solvers.
- Associative Search and CAM: Modal voltage or charge protocols interrogate arrays for exact or nearest-neighbor matches, enabling symbolic AI and high-dimensional search (Yin et al., 2024, Liu et al., 2022).
- Flexible Precision and Ternary/Binary Operations: Signed ternary (–1, 0, +1) MACs in SiTe-CiM, bit-serial/parallel processing in digital CiM, and adaptive precision for deep networks (Thakuria et al., 2024, Chauvaux et al., 2024).
- Hybrid Analog–Digital Processing: Splitting high-order and low-order bits across DCIM/ACIM, or offloading LSB computation to approximate/analog domains for energy/performance trade-offs (Yoshioka et al., 2024, Zhang et al., 2024).
In-memory implementations reduce computational complexity: e.g., O(N³) to O(N²) via native analog VMV for attention-inspired initialization in Ising machines (Qian et al., 19 Dec 2025). Modal and quantization-aware co-designs facilitate ternary/binary DNNs, probabilistic and approximate MACs, and direct mapping of LLM or SNN workloads.
3. Algorithm–Hardware Co-Design and Workflow Integration
Efficient CiM execution requires tight algorithm–hardware co-design:
- Attention-Inspired Initialization and Lightweight Solvers: For NP-hard optimization, topology-aware seeding dramatically reduces iterative cost (up to 80% iteration reduction in Ising MaxCut) (Qian et al., 19 Dec 2025).
- Quantization-Aware Training and Scale-Factor Fusion: ADC-less analog CiM achieves robust accuracy by training DNNs end-to-end with scale-factor co-quantization, managed by digital in-memory adder/subtractors (Negi et al., 2024).
- Probabilistic and Approximate Computation: PACiM minimizes energy and bit-serial cycles by encoding MAC results as stochastic or sparsity-reduced approximations with sub-1% RMSE at scale (Zhang et al., 2024).
- Monte-Carlo and Bayesian Inference: MC-CIM embeds dropout and random mask generation directly in the memory array, enabling uncertainty-aware edge inference with <1 pJ/sample energy (Shukla et al., 2021).
Compiler infrastructures (e.g., CMSwitch, CIM-MLC, CINM-Cinnamon) enable mapping of DNN computational graphs onto heterogeneous CiM arrays, dynamically optimizing mode-switching (compute/memory), resource allocation (core, array, crossbar), and pipeline depth (Zhao et al., 24 Feb 2025, Qu et al., 2024, Khan et al., 2022). Fine-grained cross-layer scheduling (CLSA-CIM) and weight-duplication deliver up to 29× inference speedup and up to 17.9× PE utilization in tiled RRAM accelerators (Pelke et al., 2024).
4. Performance, Energy Efficiency, and Comparative Metrics
Extensive benchmarking across CiM platforms demonstrates substantial energy and latency improvements over von Neumann and near-memory alternatives:
| Platform/Tech | Precision | Throughput | Energy Efficiency | Notable Results / Benchmarks |
|---|---|---|---|---|
| BEOL-FeFET Ising [2512...] | 1b (FeFET) | ~10³ MAC/cyc | Up to 175.9× GPU SB | 100 k-node Max-Cut, superior solution quality |
| SiTe-CiM (Thakuria et al., 2024) | Ternary (–1,0,+1) | 4096 MAC/cyc | 2.5×–7× vs. baseline | +18% area at array, +6.74× throughput system |
| TReCiM (Zhou et al., 2 Jan 2025) | 1b,2b (FeFET MLC) | – | Up to 48 TOPS/W | 91.3% VGG-8 accuracy, 0–85 °C resilience |
| PACiM (Zhang et al., 2024) | 8b (hybrid/PAC) | – | 14.6 TOPS/W (8b/8b) | Bit-serial cycles cut 81%, ≤1% acc. drop |
| FlexSpIM (Chauvaux et al., 2024) | up to 512b (digital) | 2.5 GSOP/s | Up to 90% system savings | DVS Gesture 95.8% accuracy, 2× bit-norm. EE over SOTA DCIM |
| AFPR-CIM (Liu et al., 2024) | FP8 (E₂M₅) | 1474.56 GFLOPS | 19.89 TFLOPS/W | 4.1×–5.4× more efficient than digital FP-CiM |
| Voxel-CIM (Lin et al., 2024) | 8–12b (SRAM-CiM) | 27.8 TOPS | 10.8 TOPS/W | Up to 8.1× segmentation speedup over 2080 Ti |
| CLSA-CIM (Pelke et al., 2024) | 4b,8b (RRAM-CiM) | – | – | Speedup 29.2×, Utilization 17.9× baseline |
The majority of energy savings (>90%) result from internalizing memory access—either by native in-array computation or pipeline fusion—and further by analog-domain summation and nonvolatility in FeFET, RRAM, or FeD arrays (Qian et al., 19 Dec 2025, Thakuria et al., 2024, Gao et al., 2019). Bit-normalized and TOPS/W figures exceed 10–40× those of digital SRAM baseline with equivalent or slightly reduced area (Chauvaux et al., 2024, Zhang et al., 2024).
Key performance trade-offs include quantization (bitwidth), area/energy overhead of ADCs and cross-coupling, row/column parallelism, array size scalability, and sensitivity to process variation or temperature (Zhou et al., 2 Jan 2025, Negi et al., 2024, Qian et al., 19 Dec 2025).
5. Applications and Implementation Domains
CiM platforms enable acceleration for workloads that combine high parallelism, locality, and data reuse:
- Large-Scale Combinatorial Optimization: Ising machines based on FeFET crossbars achieve fast convergence and speedup in Max-Cut and QUBO problems via co-designed Light-SB and attention initialization (Qian et al., 19 Dec 2025).
- Deep Neural Networks (DNNs): Ternary, binary, and sparse-quantized DNN inference/training mapped to SiTe-CiM, PACiM, and HCiM realize fast, low-power AI for edge and cloud (Thakuria et al., 2024, Zhang et al., 2024, Negi et al., 2024).
- Spiking Neural Networks (SNNs): Digital in-memory macros supporting arbitrary operand width and hybrid stationarity enable low-jitter, reconfigurable SNN inference (Chauvaux et al., 2024).
- Graph Analytics and Scientific Computing: Native vector–matrix and quadratic-form mapping in FeFET/RRAM crossbars accelerate PageRank, linear solvers, and graph power methods (Qian et al., 19 Dec 2025).
- High-Dimensional Computing: Charge- and current-domain CiM arrays (FeFET-1C, FeDs) enable ultra-parallel associative matching, symbolic search, and neuro-symbolic AI (Yin et al., 2024, Liu et al., 2022).
- Point Cloud and 3D Perception: Specialized architectures (Voxel-CIM, PC2IM) exploit in-situ map search, sparse convolutions, and feature aggregation for real-time 3D AI (Lin et al., 2024, Wang et al., 22 Mar 2026).
- Cryogenic/Quantum Computing: QAHE-based CryoCiM enables sub-fJ, single-cycle logic for read/write/NAND/NOR/XOR at 4 K for quantum control and SFQ circuits (Alam et al., 2021).
Programmability advances include RISC-V ISA extensions (CIM_CONV, CIM_READ/WRITE), end-to-end full-stack flows, and compilers supporting multi-modal mapping, code generation, resource allocation, and crossbar-pipelined scheduling (and et al., 28 Mar 2025, Qu et al., 2024, Khan et al., 2022).
6. Challenges, Open Problems, and Future Directions
Despite substantial progress, CiM research faces several key challenges:
- Precision vs. Efficiency Trade-off: Scaling analog CiM (especially ACIM) to >10 bits remains difficult due to noise, nonlinearity, and ADC/DAC energy (Yoshioka et al., 2024). Hybrid and dynamic partitioning (e.g., saliency-aware OSA-HCIM) is a promising avenue.
- Scalability and Integration: Larger arrays encounter line resistance, IR-drop, and increased variability; solutions include time-multiplexing, multi-bank/partitioning, and hierarchical analog/digital integration (Qian et al., 19 Dec 2025, Liu et al., 2024).
- Device Reliability and Process Variation: Robust architectures (e.g., 2FeFET-1T clamp in subthreshold region, charge-domain 1FeFET-1C, FeDs with field-programmable gating) are developed to increase margin and endurance (Zhou et al., 2 Jan 2025, Yin et al., 2024, Liu et al., 2022).
- ADC/DAC Overhead: Eliminating or minimizing ADCs (e.g., HCiM binary/ternary PSQ; charge-based self-digitization) is central to scaling throughput and reducing overhead (Negi et al., 2024).
- Software and Compiler Ecosystem: As architectures diversify, frameworks for hardware abstraction, tiered compilation, mapping, and end-to-end code generation (CMSwitch, CIM-MLC, CINM) are essential for support of complex models and heterogeneous fabrics (Zhao et al., 24 Feb 2025, Qu et al., 2024, Khan et al., 2022).
- System-Level Co-Design: End-to-end scheduling, pipeline fusion (CIM layer/weight fusion), and cross-layer tile scheduling (CLSA-CIM) are needed to maximize utilization and minimize idle energy (and et al., 28 Mar 2025, Pelke et al., 2024).
- Cryogenic and Specialty Computing: Extensions into QAHE, topologically protected, or neuromorphic regimes present novel opportunities but require new device/process integration and logic (Alam et al., 2021).
A plausible implication is that future advances will tightly integrate device/circuit/algorithm/compiler co-design, emphasizing hybrid analog-digital dataflow, adaptive precision, variation-aware operation, and large-scale, reconfigurable in-memory compute architectures.
7. Comparative Analysis and Broader Context
Relative to traditional and near-memory computing:
- SRAM/FeFET/DRAM CiM: Srinks data-movement and logic overhead, with nonvolatility enabling more persistent storage, especially in FeFET and FeD-based platforms (Qian et al., 19 Dec 2025, Yin et al., 2024, Liu et al., 2022).
- RRAM/PCM: Offers dense analog MAC but is limited by conductance variation and endurance (Lin et al., 2024, Liu et al., 2024).
- Programming Models: Multi-tiered compilation, dual-mode execution, and explicit ISA extensions are increasingly common and required for efficient, scalable use (Qu et al., 2024, Khan et al., 2022).
- Applicability: Highest advantage is for regular, data-parallel, locality-rich workloads—matrix-based AI, optimization, search, and signal processing (Qian et al., 19 Dec 2025, Thakuria et al., 2024, Yoshioka et al., 2024).
Continued improvements in cell design, crossbar/core architecture, algorithmic mapping, and system-level software are anticipated to further increase practical deployment of CiM in both edge and large-scale AI workloads.