Papers
Topics
Authors
Recent
Search
2000 character limit reached

Inner-Product CIM (IP-CIM)

Updated 16 May 2026
  • Inner-Product CIM (IP-CIM) is a compute-in-memory approach that performs dot product operations directly within memory arrays using digital, analog, or hybrid implementations.
  • Hybrid designs combine capacitor arrays with digital logic to achieve high energy efficiency and throughput, with metrics such as 35 TOPS/W and 28 fJ/MAC.
  • IP-CIM supports complex number operations and flexible operand widths, enabling efficient inference in deep neural networks and signal processing workloads.

Inner-Product Compute-in-Memory (IP-CIM) refers to a class of circuit-level and architectural techniques enabling multiply-accumulate (MAC) vector inner products to be performed directly within memory arrays, bypassing traditional von Neumann data movement bottlenecks. IP-CIM serves as the core primitive for high-throughput, energy-efficient inference in deep neural networks and signal-processing workloads. The defining feature is the realization of ∑kxkwk\sum_k x_k w_k (dot product) in-situ in the memory, supporting digital, analog, or hybrid (digital+analog) implementations, often in SRAM or RRAM technology. Recent IP-CIM macros integrate hybrid digital-analog operations, bit-parallel analog domain MACs, advanced digital encoding, and full system-level operand/dataflow flexibility. The following sections systematically characterize the principal IP-CIM schemes, focusing on architecture, MAC realization, complex number support, efficiency, and comparative metrics.

1. Hybrid SRAM-Based IP-CIM with 2D-Weighted Capacitor Arrays

Recent work on 6T-SRAM-based CIM macros achieves high-density, low-energy IP-CIM by hybridizing a 2D-weighted capacitor array with split digital/analog MAC domains (Konno et al., 25 Aug 2025). Each macro typically consists of multiple complex-CIM units, each containing a digital SRAM array supplemented by analog computation layers:

  • Architecture: Each 64-word × 16-bit complex-CIM unit uses standard 6T SRAM cells with double word-lines (suppressing read-disturb), overlaid with a 2D-weighted capacitor array and a 7-bit split-CDAC SAR-ADC. Three most significant bit-groups are routed to dedicated digital MAC logic (digital CIM, DCIM), and the lower-order bits are processed via charge-domain analog CIM (ACIM).
  • 2D-Weighted Capacitor Array for Analog MAC: The analog path realizes bit-wise multiplication between input bits and weight bits as charge gating events in a dense unit-capacitor (UC, 48 aF) grid. Each UC is selectively gated according to the conjunction of input and weight bits. The accumulated charge is digitized via the SAR-ADC.

Key relations:

Q=Cu⋅∑i,bxi,bwi,bVREFSR,Vout=QCtotQ = C_u \cdot \sum_{i, b} x_{i,b} w_{i,b} V_{REFSR}, \qquad V_{out} = \frac{Q}{C_{tot}}

  • Hybrid Fusion and Output Correction: Digital and analog partial sums are realigned and summed post-conversion:

IP=D+AshiftedIP = D + A_{\text{shifted}}

where DD is the digital MSB sum, and AshiftedA_{\text{shifted}} is the analog result mapped to digital dynamic range.

  • Complex Number Support: Both real and imaginary weights are co-located in the memory, enabling simultaneous computation of

(akck−bkdk)+j(akdk+bkck)(a_k c_k - b_k d_k) + j(a_k d_k + b_k c_k)

via dual analog charge paths and sign-toggled voltage references.

  • Metrics: This configuration achieves 1.80 Mb/mm2^2 memory density, 0.435% RMS error, single-conversion complex MAC latency, and 35 TOPS/W energy efficiency (28 fJ/MAC).

2. Analog Charge-Domain Bit-Parallel IP-CIM in PICO-RAM

A distinct analog IP-CIM realization is found in PICO-RAM, featuring thin-cell, charge-domain bit-parallel MACs compatible with standard 6T SRAM cells (Chen et al., 2024):

  • Bit-Parallel Inner Product (MVM): Each cell block contains a B-bit weight and accepts a B-bit input as an analog voltage from an in-situ capacitive DAC. MAC is realized by gating unit capacitors based on the input and weight, then summing charge across the row:

Qtot=Cunit∑i=1NwiVDAC,iQ_{\mathrm{tot}} = C_{\mathrm{unit}} \sum_{i=1}^N w_i V_{\mathrm{DAC},i}

After in-place charge sharing, the analog sum is digitized by a dual-threshold, 8.5-bit time-domain ADC.

  • In-Array Shift-and-Add: Multi-bit MAC is performed by stacking results from different bit-slices and merging them via charge-ratioed interconnection, ensuring high linearity and parallel accumulation.
  • PVT-Insensitive Design: Every computing block reuses MOM capacitors; purely capacitive switching assures robustness to process, voltage, and temperature variations.
  • Metrics: Achieves 559 Kb/mm2^2 storage density in 65 nm, <0.6 LSB end-to-end error over wide PVT corners, and energy savings via aggressive analog block gating.

3. Digital IP-CIM in Large-Scale LLM Accelerators

Operator-fused, digital IP-CIM macros are key for transformer models, as in FusionCIM (Xuan et al., 28 Apr 2026):

  • Bit-Serial Digital MAC: Each CIM unit stores 8-bit weights in SRAM, multiplies with 1-bit input slices (bit-serial), and accumulates results in an adder tree over 8 cycles, outputting 16-bit inner products directly to downstream softmax stages.

Schematic computation:

Score[i]=∑b=07(∑j=0127kj,bQ[i,j])≪b\text{Score}[i] = \sum_{b=0}^7 \left( \sum_{j=0}^{127} k_{j,b} Q[i,j] \right) \ll b

  • Q-stationary Dataflow: Enables high data reuse by keeping query vectors local in SRAM and streaming only key vectors. The pipelined architecture aligns with downstream exponential/nonlinear units.
  • Metrics: Delivers 1.64 TOPS (INT8) in 0.46 mmQ=Cu⋅∑i,bxi,bwi,bVREFSR,Vout=QCtotQ = C_u \cdot \sum_{i, b} x_{i,b} w_{i,b} V_{REFSR}, \qquad V_{out} = \frac{Q}{C_{tot}}0 per macro at 42 mW, scaling to 29.4 TOPS/W at the system level. By fusing QKQ=Cu⋅∑i,bxi,bwi,bVREFSR,Vout=QCtotQ = C_u \cdot \sum_{i, b} x_{i,b} w_{i,b} V_{REFSR}, \qquad V_{out} = \frac{Q}{C_{tot}}1 matrix-multiply and softmax, the cumulative data-movement and energy overheads are substantially reduced compared to non-fused digital CIM (Xuan et al., 28 Apr 2026).

4. Low-Power Digital IP-CIM with Advanced Input and Weight Encoding

Digital IP-CIM designs introducing optimized binary encodings further enhance energy efficiency and minimize partial product count (Xiao et al., 2021):

  • Modified Radix-4 (M-Rd4) Input Encoding: Input vectors are pre-encoded in a radix-4 Booth scheme, reducing the number of nonzero partial products per MAC. Hardware implements a single-pass sliding window transformation followed by Radix-4 recoding, reducing Q=Cu⋅∑i,bxi,bwi,bVREFSR,Vout=QCtotQ = C_u \cdot \sum_{i, b} x_{i,b} w_{i,b} V_{REFSR}, \qquad V_{out} = \frac{Q}{C_{tot}}2 events by up to 80%.
  • Modified Canonical Signed Digit (M-CSD) Weights: Weights are ternarized (−1, 0, +1) with run-length encoding, minimizing the number of toggled wordlines and required current-driving events.
  • Analog Integration and Low-Power ADC: Per-digit MAC is performed in a differential RRAM crossbar with passive charge integrators and a shared, low-power SAR-ADC per column group. Digital accumulation across digits completes the inner product digitally in-controller.
  • Performance: Demonstrated 60.68 TOPS/W at 8-bit precision, cutting array-level energy relative to prior digital/analog CIMs by up to 99% with less than 0.5% accuracy penalty (Xiao et al., 2021).

5. Flexible Digital IP-CIM for Spiking Neural Networks

Recent fully digital IP-CIM macros such as FlexSpIM target spiking and event-driven inference with granularity of operand width and dataflow (Chauvaux et al., 2024):

  • Unified CIM Storage: Both weights and neuron membrane potentials are stored in SRAM; inner-product MAC is realized directly in the array using bit-serial multiplication (AND on dual wordlines) and per-column, per-cycle adder trees.
  • Arbitrary Operand Resolution: Operand widths from 1–512 bits (weights) and 1–256 bits (activations) are supported, configurable per layer. Multi-bit operands are mapped across adjacent columns using flexible carry propagation and full-adder chaining.
  • Reconfigurable Stationary Dataflows: Hybrid weight-stationary and output-stationary modes, selected per layer to maximize data reuse and minimize movement.
  • Measured Results: Achieves >2× energy efficiency (44.5–56.3 fJ/SOP/bitQ=Cu⋅∑i,bxi,bwi,bVREFSR,Vout=QCtotQ = C_u \cdot \sum_{i, b} x_{i,b} w_{i,b} V_{REFSR}, \qquad V_{out} = \frac{Q}{C_{tot}}3) over prior fixed-precision digital CIM macros, with adaptive operand shaping yielding further energy savings (Chauvaux et al., 2024).

6. Comparison of IP-CIM Approaches: Metrics and Trade-Offs

Macro/Approach Technology Area Density (Mb/mmQ=Cu⋅∑i,bxi,bwi,bVREFSR,Vout=QCtotQ = C_u \cdot \sum_{i, b} x_{i,b} w_{i,b} V_{REFSR}, \qquad V_{out} = \frac{Q}{C_{tot}}4) Energy (fJ/MAC) Error/ENOB Unique Features
Hybrid SRAM + 2D Cap (Konno et al., 25 Aug 2025) 28 nm 1.80 28 0.435% RMS Hybrid analog/digital, complex MAC
PICO-RAM (Chen et al., 2024) 65 nm 0.559 N/A <0.6 LSB, linearity Pure analog, full PVT insensitivity
FusionCIM (Xuan et al., 28 Apr 2026) ? 2.03 TOPS/mmQ=Cu⋅∑i,bxi,bwi,bVREFSR,Vout=QCtotQ = C_u \cdot \sum_{i, b} x_{i,b} w_{i,b} V_{REFSR}, \qquad V_{out} = \frac{Q}{C_{tot}}5 29.4 TOPS/W INT8 digital Operator fusion, Q-stationary
M-Rd4+M-CSD (Xiao et al., 2021) 45 nm N/A 16.5 (2.0 pJ/MAC) 7.4 ENOB Digital encoding, RRAM crossbar
FlexSpIM (Chauvaux et al., 2024) 40 nm N/A 44.5–56.3 fJ/SOP/bitQ=Cu⋅∑i,bxi,bwi,bVREFSR,Vout=QCtotQ = C_u \cdot \sum_{i, b} x_{i,b} w_{i,b} V_{REFSR}, \qquad V_{out} = \frac{Q}{C_{tot}}6 Digital Arbitrary operand width, hybrid dataflow

All designs leverage in-memory MAC to minimize data movement and avoid energy bottlenecks. Hybrid digital/analog splitting improves accuracy; analog charge-domain approaches yield higher cell density and energy savings at the expense of additional calibration or digital correction logic. Digital-only approaches facilitate flexible operand shaping and advanced dataflows, at the potential cost of increased area or per-cell complexity.

7. Principal Design Insights and Future Directions

  • Hybridization: Routing MSBs digitally and LSBs in analog optimally balances area, accuracy, and energy.
  • Dataflow Co-Design: Q-stationary and weight/output-stationary mappings are critical to amortize input/output cost and maximize in-array compute ratio.
  • Encoding and Operand Flexibility: Pre-encoding inputs/weights (e.g., M-Rd4, M-CSD) and programmable operand width unlock further efficiency.
  • Circuit-Level Innovations: Novel capacitor array designs, passive analog integrators, and low-power ADC schemes are central to high-density analog IP-CIM.
  • Complex MAC Support: Explicit hardware support for complex inner-products enables single-cycle realization of both real and imaginary outputs.

A plausible implication is that future IP-CIM architectures will converge towards tightly-coupled hybrid analog/digital macros with integrated operand and dataflow programmability, supported by a co-optimized system stack minimizing memory access and maximizing per-area throughput and accuracy (Konno et al., 25 Aug 2025, Chen et al., 2024, Xuan et al., 28 Apr 2026, Chauvaux et al., 2024, Xiao et al., 2021).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Inner-Product CIM (IP-CIM).