SRAM-Based CiM: In-Memory Computation
- SRAM-based CiM is a computing paradigm that integrates arithmetic operations directly into high-density SRAM arrays, minimizing data movement bottlenecks.
- It employs both analog and digital multiply-accumulate techniques—such as charge redistribution and sense-amplifier logic—to enable efficient in-situ processing.
- Research spans from variation-resilient BNN analog engines to hybrid digital-analog accelerators, demonstrating high energy efficiency and throughput in diverse applications.
Static random-access memory (SRAM)-based computation-in-memory (CiM) is a paradigm that merges digital or mixed-signal processing directly with embedded high-density memory arrays, most commonly 6T-SRAM, to minimize data movement bottlenecks in modern deep learning, privacy computing, database processing, and edge inference. In this approach, the bitwise multiply-accumulate (MAC) operations that dominate contemporary neural, signal, and cryptographic workloads are directly mapped to in-situ arithmetic in SRAM arrays via analog signal summation, logic-in-sense-amplifier, charge-redistribution, or time-domain conversion. Across research work, SRAM-based CiM spans architectures using highly parallel analog MAC in the array, ultra-efficient binarized neural nets, fully programmable digital MAC dataflows, and hybrid data-movement–optimized semidigital pipelines. The field is technically anchored by innovations ranging from variation-resilient analog MAC to converged digital–analog hybrids, open-source co-design tooling, and large-scale wafer-level integration, with performance records exceeding several thousand TOPS/W for selected digital accelerators.
1. Operating Principles and SRAM CiM Cell Architectures
The foundational principle of SRAM-based CiM is the direct execution of dot-product or bitwise logic operations inside high-density SRAM arrays, using either digital or analog mechanisms (Yoshioka et al., 2024). The two dominant styles are:
- Analog CIM (ACIM): Transforms stored bitline charges, cell currents, or row-shared capacitive domains into analog MAC accumulation. Examples include current-mode (summed cell currents per bitline), time-domain (delay/oscillator-based pulse summing), and charge-domain (charge redistribution on MOM caps or bitline parasitics) schemes (Yin et al., 2022). Analog MAC enables large vector parallelism but is sensitive to process voltage temperature (PVT) and mismatch effects.
- Digital CIM (DCIM): Implements bit-wise AND/XOR inside the memory via simultaneous wordline activation plus digital sense amplifier logic, optionally including local adder trees. Resulting partial sums are combined digitally either within or near the array (Yoshioka et al., 2024). DCIM offers high precision and error-free operation but higher energy per operation.
Hybrid architectures split the MAC precision, performing high-order bits digitally (for accuracy) and low-order bits via compact analog accumulators (for energy efficiency), often with dynamic per-operation partitioning (Konno et al., 25 Aug 2025, Chen et al., 2023).
The canonical SRAM CiM cell is the 6T bitcell, with enhancements in some designs to 8T/10T/9T cells that provide read isolation, extra compute ports, or integrated charge domains required for analog charge injection or multi-port access (Kim et al., 2022, Wang et al., 2023, Chen et al., 2022).
2. Array-Level Dataflow and Peripheral Integration
In a typical SRAM CiM macro, the memory array is organized as a two-dimensional crossbar, with weights stored in cell states and inputs mapped to wordline drivers (Le et al., 2021, Kuo et al., 2022). Energy- and throughput-optimized dataflows usually follow a weight-stationary strategy—keeping weights resident and streaming activations from input buffers. Feature maps and intermediate sums are staged in adjacent SRAMs with ping-pong banks or pipelined update mechanisms (Kuo et al., 2022).
Peripheral circuits are heavily co-optimized:
- Sense Amplifiers (SAs)/Current Sense: For analog CIM, SAs sense the net current difference across column bitlines, directly binarizing outputs (for BNNs) or digitizing multi-bit accumulated analog signals.
- Analog-to-Digital Converters (ADCs): Multi-bit analog MACs are converted into digital partial sums. Optimization of ADC resolution and in-situ reference generation (e.g., utilizing dedicated charge-sharing columns) directly affects area, linearity, and achieved TOPS/W (Kim et al., 2022, Wang et al., 2023).
- Pooling/Activation Writeback: Some architectures implement convolution and pooling fusions, activation clamping (ReLU), or dynamic quantization directly at the sense/ADC periphery for performance and latency minimization (Kuo et al., 2022, Yin et al., 2022).
Table: Example array and peripheral configurations
| Macro Type | Cell Array | Core Peripherals | Noted Features |
|---|---|---|---|
| PR-CIM (Le et al., 2021) | 6T, M×N crossbar | CSA+Bias, Input splitting | Variation-aware binarized MAC |
| PSCNN (Kuo et al., 2022) | 10T, 1M cells | SAs, Ping-pong SRAM, Pooling | Single large macro, 1D binary |
| P-8T CD-CiM (Kim et al., 2022) | 8T/16×5 banks | Charge-shared DAC, 4b ADC | All-in-SRAM DAC+ADC |
| Hybrid D/A (Konno et al., 25 Aug 2025) | 6T w/ 2D MOM caps | Digital popcount, SAR-ADC | Complex MAC, real/imag split |
3. Process Variation, Precision, and Robust Training
A central technical problem in analog SRAM-CiM is process variation, which affects cell current distributions and thus MAC accuracy. The PR-CIM framework for BNNs (Le et al., 2021) provides a concrete model: under 65 nm process variation, SRAM-based analog CiM can see BNN classification accuracies collapse from near-software baseline (~90%) to <20%. PR-CIM addresses this by:
- Statistical modeling of cell variation (cell currents as log-normal distributions).
- Injecting MC-derived variation directly into forward propagations during BNN training—using stochastic weight re-polarization and activation binarizer perturbation.
- Hardware-aware mapping and calibration, including peripheral voltage optimization (VWL, VBL) to maximize single-cell current margin IM while avoiding read disturb.
This closed-loop variation-aware training recovers nearly all baseline accuracy (88.7%→87.76% VGG-9 under worst-case variation). Similar variation-resilient or hybrid training schemes have been generalized for multi-bit charge-domain and hybrid digital/analog macros (Chen et al., 2024, Chen et al., 2023), often informed by circuit simulations (Monte Carlo, SNM distributions).
4. System Mapping, Scheduling, and Compiler Automation
The proliferation of SRAM-CiM architectural primitives (macro size, precision, storage:compute ratios, buffer sizes, dataflows) necessitates automated system-level mapping. CIM-Tuner (Chen et al., 26 Jan 2026) models the hardware–mapping co-exploration as a Pareto optimization over:
- Macro counts (MR, MC)
- Storage-Compute Ratio (SCR)
- Buffer sizes (IS_SIZE, OS_SIZE)
- Per-layer scheduling: weight-stationary (NR) or input-stationary (R), and macro-level tiling (accumulate-first vs. parallel-first)
This tool abstracts all macro types (digital, analog, hybrid) into accumulation-length, parallelism, and bandwidth parameters and uses simulated annealing plus cycle-accurate simulators to maximize throughput and TOPS/W under area and bandwidth constraints. It achieves up to 2.11× throughput and 1.58× energy efficiency versus prior mapping solutions under fixed area budgets. The methodology is validated with silicon measurement on 28 nm hardware (Chen et al., 26 Jan 2026).
Programmable architectures such as CIMR-V (and et al., 28 Mar 2025) and PSCNN (Kuo et al., 2022) employ RISC-V–extension or custom ISA to express CiM operations, facilitate pipeline fusion (conv, pool, weight prefetch), and enable compiler-intrinsic integration from high-level ML frameworks (PyTorch/TensorFlow) to hardware-mapped instruction streams.
Approximate execution and error-tolerance are now also exposed to design automation. OpenACM (Zhou et al., 16 Jan 2026) integrates accuracy-aware approximate multiplier libraries, hardware variation yield analysis, and open-source flow (OpenROAD/FreePDK45), enabling the designer to trade precision for energy reductions of up to 64% without significant loss in top-1 accuracy.
5. Performance Benchmarks and Experimental Results
SRAM-CiM accelerators have reached competitive and sometimes record figures for energy and area efficiency on neural and cryptographic workloads:
| Architecture | Precision | Throughput (TOPS) | Efficiency (TOPS/W) | Notable Result |
|---|---|---|---|---|
| CIMR-V (and et al., 28 Mar 2025) | 1×1 | 26.2 | 3707.8 | End-to-end inference, 85% latency↓ |
| PSCNN (Kuo et al., 2022) | binary | 0.15 | 885.9 | Flexible ISA for 1D binary CNNs |
| Hybrid D/A (Konno et al., 25 Aug 2025) | 8×8 cplx | ~0.73 | 35.0 | 1.80 Mb/mm², 0.435% MAC error |
| 4×4 MAC w/ 9-bit ADC (Wang et al., 2023) | 4×4 | ~0.5 | 137.5 | Cell-embedded ADC, ~0.6% MAC error |
| P-8T CD-CiM (Kim et al., 2022) | 4×8 | – | 50.07 | All-in-SRAM charge DAC/ADC, 91.5% acc. |
| FAST (Chen et al., 2022) | 8-bit rowop | – | ~4.4× baseline | High-concurrency shift-ALU, >22× spdup |
Variation-resilient BNNs on PR-CIM (Le et al., 2021) recover >87.7% CIFAR-10 accuracy under strong process variation; charge-domain macros with end-to-end calibration and analog error compensation achieve 90.7%/66.2% accuracy (CIFAR-10/100), with sub-percent loss relative to floating-point (Chen et al., 2024).
Large-scale system design is exemplified by Ouroboros (Liu et al., 3 Mar 2026), a wafer-scale, all-SRAM digital CIM architecture. It retains all weights, activations, and KV caches on 54 GB SRAM, enabling energy efficiency gains up to ×17 and throughput up to ×9.1 against A100/TPUv4 LLM inference systems.
6. Application Domains and Emerging Innovations
SRAM-based CiM accelerators now span:
- Deep neural network inference and training: Binary/multi-bit quantized CNNs, RNNs, transformers, with hybrid or variation-resilient training (Le et al., 2021, Kuo et al., 2022, and et al., 28 Mar 2025, Yoshioka et al., 2024).
- Privacy computing: Modular arithmetic, Barrett multiplication for homomorphic encryption and ZKP implemented with workload-partitioned, high-parallel macros (Li et al., 5 Nov 2025).
- Multi-port and SNN processing: Multiport SRAM cell designs for high-performance spiking neural networks (SNN) (Huijbregts et al., 2024), in-place membrane update/decay for O(1) latency (Shang et al., 13 Mar 2026).
- Data-intensive scientific and transactional workloads: Fully-concurrent batched shift-adds in databases or GNNs (Chen et al., 2022).
Emerging directions include hybrid digital/analog boundaries dynamically chosen based on MAC saliency (Chen et al., 2023), wafer-scale hierarchical mapping/fault tolerance for large models (Liu et al., 3 Mar 2026), and increasing use of open-source toolchains for physical design and approximate computing exploration (Zhou et al., 16 Jan 2026).
7. Challenges, Trade-Offs, and Design Guidelines
Key challenges include:
- Variation, noise, and PVT resilience: Extensive Monte Carlo and hardware-aware training, often using high-order circuit statistical models, are needed for robust operation (Le et al., 2021, Yoshioka et al., 2024, Chen et al., 2024).
- ADC/DAC design: Resolution, linearity, and power must be carefully managed. Innovations include cell-embedded ADCs, reference sharing, and time-domain replacement for classic ADCs (Wang et al., 2023, Challagundla et al., 1 Jan 2026).
- Scalability of mapping and scheduling: Hardware–mapping co-exploration, operator-wise tuning, and multi-level accelerator templates have proven synergistic (Chen et al., 26 Jan 2026).
- Area/energy/precision trade-offs: Analog and hybrid designs are best for 3–8 bit precision; digital dominates at >10 bits or for stringent SNR, with increasing attention to error-aware and approximate computing for AI (Zhou et al., 16 Jan 2026, Chen et al., 2023).
Guidelines for designers include: abstraction of macros into vector/parallel/bandwidth models for mapping; per-layer variation-aware training and calibration; peripheral integration of non-linearities and ADCs to minimize data movement; and tuning of macro size and buffer depth for workload-optimized throughput.
SRAM-based CiM has evolved into a mature research area with extensive technological diversity, guided by multidisciplinary codesign of device, circuit, architecture, and learning/training protocol. The evolution from simple BNN analog engines to programmable, system-scale, and variation-tolerant dataflow accelerators exemplifies the rapid pace and technical breadth of this field. The cited works together underpin the ongoing migration of memory-compute fusion from prototype macros to end-to-end, system-level acceleration platforms.