Inner-Product CIM (IP-CIM)
- Inner-Product CIM (IP-CIM) is a compute-in-memory approach that performs dot product operations directly within memory arrays using digital, analog, or hybrid implementations.
- Hybrid designs combine capacitor arrays with digital logic to achieve high energy efficiency and throughput, with metrics such as 35 TOPS/W and 28 fJ/MAC.
- IP-CIM supports complex number operations and flexible operand widths, enabling efficient inference in deep neural networks and signal processing workloads.
Inner-Product Compute-in-Memory (IP-CIM) refers to a class of circuit-level and architectural techniques enabling multiply-accumulate (MAC) vector inner products to be performed directly within memory arrays, bypassing traditional von Neumann data movement bottlenecks. IP-CIM serves as the core primitive for high-throughput, energy-efficient inference in deep neural networks and signal-processing workloads. The defining feature is the realization of (dot product) in-situ in the memory, supporting digital, analog, or hybrid (digital+analog) implementations, often in SRAM or RRAM technology. Recent IP-CIM macros integrate hybrid digital-analog operations, bit-parallel analog domain MACs, advanced digital encoding, and full system-level operand/dataflow flexibility. The following sections systematically characterize the principal IP-CIM schemes, focusing on architecture, MAC realization, complex number support, efficiency, and comparative metrics.
1. Hybrid SRAM-Based IP-CIM with 2D-Weighted Capacitor Arrays
Recent work on 6T-SRAM-based CIM macros achieves high-density, low-energy IP-CIM by hybridizing a 2D-weighted capacitor array with split digital/analog MAC domains (Konno et al., 25 Aug 2025). Each macro typically consists of multiple complex-CIM units, each containing a digital SRAM array supplemented by analog computation layers:
- Architecture: Each 64-word × 16-bit complex-CIM unit uses standard 6T SRAM cells with double word-lines (suppressing read-disturb), overlaid with a 2D-weighted capacitor array and a 7-bit split-CDAC SAR-ADC. Three most significant bit-groups are routed to dedicated digital MAC logic (digital CIM, DCIM), and the lower-order bits are processed via charge-domain analog CIM (ACIM).
- 2D-Weighted Capacitor Array for Analog MAC: The analog path realizes bit-wise multiplication between input bits and weight bits as charge gating events in a dense unit-capacitor (UC, 48 aF) grid. Each UC is selectively gated according to the conjunction of input and weight bits. The accumulated charge is digitized via the SAR-ADC.
Key relations:
- Hybrid Fusion and Output Correction: Digital and analog partial sums are realigned and summed post-conversion:
where is the digital MSB sum, and is the analog result mapped to digital dynamic range.
- Complex Number Support: Both real and imaginary weights are co-located in the memory, enabling simultaneous computation of
via dual analog charge paths and sign-toggled voltage references.
- Metrics: This configuration achieves 1.80 Mb/mm memory density, 0.435% RMS error, single-conversion complex MAC latency, and 35 TOPS/W energy efficiency (28 fJ/MAC).
2. Analog Charge-Domain Bit-Parallel IP-CIM in PICO-RAM
A distinct analog IP-CIM realization is found in PICO-RAM, featuring thin-cell, charge-domain bit-parallel MACs compatible with standard 6T SRAM cells (Chen et al., 2024):
- Bit-Parallel Inner Product (MVM): Each cell block contains a B-bit weight and accepts a B-bit input as an analog voltage from an in-situ capacitive DAC. MAC is realized by gating unit capacitors based on the input and weight, then summing charge across the row:
After in-place charge sharing, the analog sum is digitized by a dual-threshold, 8.5-bit time-domain ADC.
- In-Array Shift-and-Add: Multi-bit MAC is performed by stacking results from different bit-slices and merging them via charge-ratioed interconnection, ensuring high linearity and parallel accumulation.
- PVT-Insensitive Design: Every computing block reuses MOM capacitors; purely capacitive switching assures robustness to process, voltage, and temperature variations.
- Metrics: Achieves 559 Kb/mm storage density in 65 nm, <0.6 LSB end-to-end error over wide PVT corners, and energy savings via aggressive analog block gating.
3. Digital IP-CIM in Large-Scale LLM Accelerators
Operator-fused, digital IP-CIM macros are key for transformer models, as in FusionCIM (Xuan et al., 28 Apr 2026):
- Bit-Serial Digital MAC: Each CIM unit stores 8-bit weights in SRAM, multiplies with 1-bit input slices (bit-serial), and accumulates results in an adder tree over 8 cycles, outputting 16-bit inner products directly to downstream softmax stages.
Schematic computation:
- Q-stationary Dataflow: Enables high data reuse by keeping query vectors local in SRAM and streaming only key vectors. The pipelined architecture aligns with downstream exponential/nonlinear units.
- Metrics: Delivers 1.64 TOPS (INT8) in 0.46 mm0 per macro at 42 mW, scaling to 29.4 TOPS/W at the system level. By fusing QK1 matrix-multiply and softmax, the cumulative data-movement and energy overheads are substantially reduced compared to non-fused digital CIM (Xuan et al., 28 Apr 2026).
4. Low-Power Digital IP-CIM with Advanced Input and Weight Encoding
Digital IP-CIM designs introducing optimized binary encodings further enhance energy efficiency and minimize partial product count (Xiao et al., 2021):
- Modified Radix-4 (M-Rd4) Input Encoding: Input vectors are pre-encoded in a radix-4 Booth scheme, reducing the number of nonzero partial products per MAC. Hardware implements a single-pass sliding window transformation followed by Radix-4 recoding, reducing 2 events by up to 80%.
- Modified Canonical Signed Digit (M-CSD) Weights: Weights are ternarized (−1, 0, +1) with run-length encoding, minimizing the number of toggled wordlines and required current-driving events.
- Analog Integration and Low-Power ADC: Per-digit MAC is performed in a differential RRAM crossbar with passive charge integrators and a shared, low-power SAR-ADC per column group. Digital accumulation across digits completes the inner product digitally in-controller.
- Performance: Demonstrated 60.68 TOPS/W at 8-bit precision, cutting array-level energy relative to prior digital/analog CIMs by up to 99% with less than 0.5% accuracy penalty (Xiao et al., 2021).
5. Flexible Digital IP-CIM for Spiking Neural Networks
Recent fully digital IP-CIM macros such as FlexSpIM target spiking and event-driven inference with granularity of operand width and dataflow (Chauvaux et al., 2024):
- Unified CIM Storage: Both weights and neuron membrane potentials are stored in SRAM; inner-product MAC is realized directly in the array using bit-serial multiplication (AND on dual wordlines) and per-column, per-cycle adder trees.
- Arbitrary Operand Resolution: Operand widths from 1–512 bits (weights) and 1–256 bits (activations) are supported, configurable per layer. Multi-bit operands are mapped across adjacent columns using flexible carry propagation and full-adder chaining.
- Reconfigurable Stationary Dataflows: Hybrid weight-stationary and output-stationary modes, selected per layer to maximize data reuse and minimize movement.
- Measured Results: Achieves >2× energy efficiency (44.5–56.3 fJ/SOP/bit3) over prior fixed-precision digital CIM macros, with adaptive operand shaping yielding further energy savings (Chauvaux et al., 2024).
6. Comparison of IP-CIM Approaches: Metrics and Trade-Offs
| Macro/Approach | Technology | Area Density (Mb/mm4) | Energy (fJ/MAC) | Error/ENOB | Unique Features |
|---|---|---|---|---|---|
| Hybrid SRAM + 2D Cap (Konno et al., 25 Aug 2025) | 28 nm | 1.80 | 28 | 0.435% RMS | Hybrid analog/digital, complex MAC |
| PICO-RAM (Chen et al., 2024) | 65 nm | 0.559 | N/A | <0.6 LSB, linearity | Pure analog, full PVT insensitivity |
| FusionCIM (Xuan et al., 28 Apr 2026) | ? | 2.03 TOPS/mm5 | 29.4 TOPS/W | INT8 digital | Operator fusion, Q-stationary |
| M-Rd4+M-CSD (Xiao et al., 2021) | 45 nm | N/A | 16.5 (2.0 pJ/MAC) | 7.4 ENOB | Digital encoding, RRAM crossbar |
| FlexSpIM (Chauvaux et al., 2024) | 40 nm | N/A | 44.5–56.3 fJ/SOP/bit6 | Digital | Arbitrary operand width, hybrid dataflow |
All designs leverage in-memory MAC to minimize data movement and avoid energy bottlenecks. Hybrid digital/analog splitting improves accuracy; analog charge-domain approaches yield higher cell density and energy savings at the expense of additional calibration or digital correction logic. Digital-only approaches facilitate flexible operand shaping and advanced dataflows, at the potential cost of increased area or per-cell complexity.
7. Principal Design Insights and Future Directions
- Hybridization: Routing MSBs digitally and LSBs in analog optimally balances area, accuracy, and energy.
- Dataflow Co-Design: Q-stationary and weight/output-stationary mappings are critical to amortize input/output cost and maximize in-array compute ratio.
- Encoding and Operand Flexibility: Pre-encoding inputs/weights (e.g., M-Rd4, M-CSD) and programmable operand width unlock further efficiency.
- Circuit-Level Innovations: Novel capacitor array designs, passive analog integrators, and low-power ADC schemes are central to high-density analog IP-CIM.
- Complex MAC Support: Explicit hardware support for complex inner-products enables single-cycle realization of both real and imaginary outputs.
A plausible implication is that future IP-CIM architectures will converge towards tightly-coupled hybrid analog/digital macros with integrated operand and dataflow programmability, supported by a co-optimized system stack minimizing memory access and maximizing per-area throughput and accuracy (Konno et al., 25 Aug 2025, Chen et al., 2024, Xuan et al., 28 Apr 2026, Chauvaux et al., 2024, Xiao et al., 2021).