- The paper presents a digital CIM architecture featuring a two-phase RCW mechanism that minimizes weight update latency and reduces energy costs.
- It introduces a WS-OCS dataflow that cuts external DRAM access by 51.6% and internal weight updates by 87.6%, enabling efficient scaling for LLMs.
- The design incorporates nonlinear operator fusion for FP16 softmax and RMSNorm, achieving up to a 69.17% reduction in decoding latency.
RCW-CIM: Architectural Innovations for CIM-based LLM Acceleration
Overview and Motivation
The rapid growth of LLMs in both parameter count and operational complexity has aggravated memory bandwidth and weight update bottlenecks in traditional accelerator architectures. Digital compute-in-memory (DCIM) is increasingly leveraged to mitigate data movement demands, preserve precision, and support on-chip high-throughput computation. Despite these advances, existing CIM-based solutions often neglect the impact of weight update latency and global nonlinear function dependencies, both of which can be pronounced for LLM workloads, given weight sizes that far exceed a single CIM macro’s storage capacity. The RCW-CIM architecture addresses these limitations through a suite of hardware and dataflow innovations expressly tailored for LLM execution.
Architectural Contributions
RCW-CIM features a tightly integrated system comprised of eight CIM clusters—each with four energy-efficient digital SRAM-based CIM cores, 64 KB input buffers, and partial sum storage—coordinated via a novel weight-stationary output-column-stationary (WS-OCS) scheduling unit. The macro supports FP16 precision for nonlinear operations and dual-mode INT4/INT8 computing. Each CIM bank executes 32 MAC operations in parallel, enabling scalable throughput.
Read-Compute/Write (RCW) Operation: The architecture introduces a two-phase RCW mechanism at the CIM macro level. Phase one concurrently reads and latches weights into the adder tree; phase two executes parallel MAC operations while simultaneously updating the weights. By keeping the wordline active throughout both phases, the overhead from frequent weight update cycles is effectively overlapped and amortized with computation, reducing overall latency and associated precharge energy costs.
Dataflow: WS-OCS Optimization
Traditional dataflows like input-stationary output-stationary (IS-OS) and weight-stationary output-stationary (WS-OS) are ill-suited for CIM architectures due to their numerous, often redundant, weight update operations, which are exacerbated at LLM scales. The proposed WS-OCS dataflow maps weight blocks column-wise to minimize weight buffer turnover. Intermediate partial sums are efficiently accumulated per-column and stored locally, thus reducing writebacks and avoiding unnecessary DRAM accesses.
This reconfiguration results in a 51.6% reduction in external DRAM access and an 87.6% reduction in internal CIM weight updates during the prefill phase for 1024 tokens (Llama2-7B). By decreasing the frequency and volume of weight mapping, WS-OCS also enables a lower-weight buffer capacity, permitting higher CIM macro utilization for large LLM deployments.
Nonlinear Operator Fusion
The RCW-CIM incorporates a nonlinear operator fusion controller specialized for high-precision (FP16) group Softmax and RMSNorm, both of which exhibit strong cross-row dependencies in LLM Transformers. Unlike previous implementations that execute full accumulations serially and only at INT8 or lower precision, this fusion engine leverages a 64-segment LUT-based approximation for exponentials and provides both full and partial accumulations in parallel. This approach minimizes global reduction dependencies, prevents numerical instability, and further increases macro efficiency for intricate activation and normalization tasks.
For RMSNorm, synchronization with the global RMS value is fused with scaling, further reducing dependency chains and boosting throughput.
Fabricated in TSMC 22 nm CMOS and validated on Llama2-7B (INT4 weights, INT8 activations, FP16 nonlinear operations), RCW-CIM achieves the following:
- Throughput: 3.28 TOPS and 42.3 TOPS/W energy efficiency at 100 MHz
- Prefill Latency: 4.2 ms for 1024 tokens
- Decoding Rate: 26.87 tokens/s with dual DDR5-6400 memory
Strong numerical reductions in critical path latencies are demonstrated:
- Decoding computation latency is reduced by 21.59% via RCW operation.
- Nonlinear operator fusion achieves an additional 69.17% reduction in decoding latency.
- Overall prefill latency is reduced by 49.76% compared to WS-OS dataflow baselines.
Compared to prior digital CIM work ([5]), which lacks nonlinear operator fusion and focuses on INT8 support with higher precision loss, RCW-CIM sustains high accuracy via FP16 and reduces both internal and external memory bottlenecks. Architectures disregarding weight update optimization or nonlinear fusion cannot match the latency and efficiency gains demonstrated here.
Implications and Future Directions
The RCW-CIM platform underscores the necessity for holistic hardware/software co-design in LLM accelerators, making weight update latency and nonlinear function richness first-class considerations. In practical deployments, the marked latency and bandwidth reductions directly enable larger LLM models to be served with modest hardware resource increases, narrowing the gap for on-premise or edge inference.
Theoretically, the vectorized WS-OCS dataflow and parallel nonlinear operator computation establish a reference for future LLM architecture research. Emerging LLMs with greater sparsity, dynamic structures, or hybrid quantization will benefit from adaptable CIM schedules and further operator fusion advances. Integration with algorithm-aware scheduling and in-memory compute logic for attention or MoE blocks could provide further system-level performance scaling.
Conclusion
RCW-CIM provides a comprehensive hardware solution to major inefficiencies in CIM-based LLM acceleration: weight update latency and nonlinear function bottlenecks. With its WS-OCS dataflow, RCW scheduling, and nonlinear operator fusion, the architecture attains substantial reductions in latency and memory access while preserving high accuracy, as evidenced by principled evaluation on Llama2-7B. The work advances the frontier of DCIM for LLMs and sets the stage for further exploration of heterogeneous and operator-specialized accelerators in large-scale NLP.