RCW-CIM: A Digital CIM-based LLM Accelerator with Read-Compute/Write

Published 30 Apr 2026 in cs.AR | (2604.27384v1)

Abstract: Digital computing-in-memory (DCIM) has emerged as a promising solution for LLM acceleration by minimizing data transfers between external DRAM and on-chip accelerators while maintaining high precision for superior accuracy. However, existing CIM architectures often overlook weight update latency, which becomes critical as LLM weights are far larger than a single CIM macro capacity. To address this issue, this paper proposes a read-compute/write (RCW) architecture that effectively minimizes weight update latency, along with a nonlinear operator fusion that further mitigates dependencyinduced latency. The proposed RCW reduces decoding computing latency by 21.59% on the Llama2-7B model. In addition, the nonlinear operator fusion mechanism achieves a 69.17% latency reduction through efficient partial accumulation and group-based approximation. Furthermore, a weight-stationary and output column stationary (WS-OCS) dataflow is introduced to reduce both external DRAM access and internal CIM weight updates by 51.6% and 87.6% respectively during the prefill phase of 1024 tokens, leading to an overall 49.76% latency reduction. Fabricated using TSMC 22 nm CMOS technology and operating at 100 MHz, the proposed RCW-CIM achieves 3.28 TOPS and 42.3 TOPS/W, enabling 4.2 ms prefill latency and 26.87 decoded tokens per second for the INT4-weight Llama2 model with dual DDR5-6400 memory.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper presents a digital CIM architecture featuring a two-phase RCW mechanism that minimizes weight update latency and reduces energy costs.
It introduces a WS-OCS dataflow that cuts external DRAM access by 51.6% and internal weight updates by 87.6%, enabling efficient scaling for LLMs.
The design incorporates nonlinear operator fusion for FP16 softmax and RMSNorm, achieving up to a 69.17% reduction in decoding latency.

RCW-CIM: Architectural Innovations for CIM-based LLM Acceleration

Overview and Motivation

The rapid growth of LLMs in both parameter count and operational complexity has aggravated memory bandwidth and weight update bottlenecks in traditional accelerator architectures. Digital compute-in-memory (DCIM) is increasingly leveraged to mitigate data movement demands, preserve precision, and support on-chip high-throughput computation. Despite these advances, existing CIM-based solutions often neglect the impact of weight update latency and global nonlinear function dependencies, both of which can be pronounced for LLM workloads, given weight sizes that far exceed a single CIM macro’s storage capacity. The RCW-CIM architecture addresses these limitations through a suite of hardware and dataflow innovations expressly tailored for LLM execution.

Architectural Contributions

RCW-CIM features a tightly integrated system comprised of eight CIM clusters—each with four energy-efficient digital SRAM-based CIM cores, 64 KB input buffers, and partial sum storage—coordinated via a novel weight-stationary output-column-stationary (WS-OCS) scheduling unit. The macro supports FP16 precision for nonlinear operations and dual-mode INT4/INT8 computing. Each CIM bank executes 32 MAC operations in parallel, enabling scalable throughput.

Read-Compute/Write (RCW) Operation: The architecture introduces a two-phase RCW mechanism at the CIM macro level. Phase one concurrently reads and latches weights into the adder tree; phase two executes parallel MAC operations while simultaneously updating the weights. By keeping the wordline active throughout both phases, the overhead from frequent weight update cycles is effectively overlapped and amortized with computation, reducing overall latency and associated precharge energy costs.

Dataflow: WS-OCS Optimization

Traditional dataflows like input-stationary output-stationary (IS-OS) and weight-stationary output-stationary (WS-OS) are ill-suited for CIM architectures due to their numerous, often redundant, weight update operations, which are exacerbated at LLM scales. The proposed WS-OCS dataflow maps weight blocks column-wise to minimize weight buffer turnover. Intermediate partial sums are efficiently accumulated per-column and stored locally, thus reducing writebacks and avoiding unnecessary DRAM accesses.

This reconfiguration results in a 51.6% reduction in external DRAM access and an 87.6% reduction in internal CIM weight updates during the prefill phase for 1024 tokens (Llama2-7B). By decreasing the frequency and volume of weight mapping, WS-OCS also enables a lower-weight buffer capacity, permitting higher CIM macro utilization for large LLM deployments.

Nonlinear Operator Fusion

The RCW-CIM incorporates a nonlinear operator fusion controller specialized for high-precision (FP16) group Softmax and RMSNorm, both of which exhibit strong cross-row dependencies in LLM Transformers. Unlike previous implementations that execute full accumulations serially and only at INT8 or lower precision, this fusion engine leverages a 64-segment LUT-based approximation for exponentials and provides both full and partial accumulations in parallel. This approach minimizes global reduction dependencies, prevents numerical instability, and further increases macro efficiency for intricate activation and normalization tasks.

For RMSNorm, synchronization with the global RMS value is fused with scaling, further reducing dependency chains and boosting throughput.

Empirical Performance and Comparative Analysis

Fabricated in TSMC 22 nm CMOS and validated on Llama2-7B (INT4 weights, INT8 activations, FP16 nonlinear operations), RCW-CIM achieves the following:

Throughput: 3.28 TOPS and 42.3 TOPS/W energy efficiency at 100 MHz
Prefill Latency: 4.2 ms for 1024 tokens
Decoding Rate: 26.87 tokens/s with dual DDR5-6400 memory

Strong numerical reductions in critical path latencies are demonstrated:

Decoding computation latency is reduced by 21.59% via RCW operation.
Nonlinear operator fusion achieves an additional 69.17% reduction in decoding latency.
Overall prefill latency is reduced by 49.76% compared to WS-OS dataflow baselines.

Compared to prior digital CIM work ([5]), which lacks nonlinear operator fusion and focuses on INT8 support with higher precision loss, RCW-CIM sustains high accuracy via FP16 and reduces both internal and external memory bottlenecks. Architectures disregarding weight update optimization or nonlinear fusion cannot match the latency and efficiency gains demonstrated here.

Implications and Future Directions

The RCW-CIM platform underscores the necessity for holistic hardware/software co-design in LLM accelerators, making weight update latency and nonlinear function richness first-class considerations. In practical deployments, the marked latency and bandwidth reductions directly enable larger LLM models to be served with modest hardware resource increases, narrowing the gap for on-premise or edge inference.

Theoretically, the vectorized WS-OCS dataflow and parallel nonlinear operator computation establish a reference for future LLM architecture research. Emerging LLMs with greater sparsity, dynamic structures, or hybrid quantization will benefit from adaptable CIM schedules and further operator fusion advances. Integration with algorithm-aware scheduling and in-memory compute logic for attention or MoE blocks could provide further system-level performance scaling.

Conclusion

RCW-CIM provides a comprehensive hardware solution to major inefficiencies in CIM-based LLM acceleration: weight update latency and nonlinear function bottlenecks. With its WS-OCS dataflow, RCW scheduling, and nonlinear operator fusion, the architecture attains substantial reductions in latency and memory access while preserving high accuracy, as evidenced by principled evaluation on Llama2-7B. The work advances the frontier of DCIM for LLMs and sets the stage for further exploration of heterogeneous and operator-specialized accelerators in large-scale NLP.

Markdown Report Issue