CIMple: Standard-cell SRAM-based CIM with LUT-based split softmax for attention acceleration

Published 17 Apr 2026 in cs.AR | (2604.15944v2)

Abstract: LLMs such as LLaMA and DeepSeek, are built on transformer architectures, which have become a standard model for achieving state-of-the-art performance in natural language processing tasks. Recently, there has been growing interest in deploying LLMs on edge devices. Although smaller LLM models are being proposed, they often still contain billions of parameters. Since edge devices are limited in their resources this poses a significant challenge for edge deployment. Compute-in-memory (CIM) is a promising architecture that addresses this by reducing data movement through the integration of computational logic directly into memory. However, existing CIM architectures support only static Multiply-Accumulate (MAC) operations which limit their configurability in supporting nonlinear operations and various types of transformer models. This paper presents a fully digital standard-cell SRAM-based CIM architecture accelerator for self-attention, called CIMple, designed to overcome these limitations, inside transformer models. The key contributions of CIMple are: 1) A novel dual-banked CIM-based fully digital self-attention accelerator using 8-bit parallel weight feeding. 2) A look-up-table (LUT) based fixed-point implementation reducing latency with minimal accuracy degradation. 3) A performance evaluation of a 32kb CIM-based self-attention accelerator implemented in 28nm, which achieves 26.1 TOPS/W at 0.85V and 2.31 TOPS/mm$^2$ at 1.2V, both with INT8 precision.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces a digital SRAM-based CIM macro with a novel LUT-based split softmax to accelerate self-attention while supporting INT8 quantization.
The architecture leverages standard-cell synthesis for scalable design and flexible mapping across encoder, decoder, and encoder-decoder transformer models.
Evaluation demonstrates a 33% latency reduction and energy efficiency up to 57.9 TOPS/W, optimizing performance for resource-constrained edge devices.

CIMple: Standard-cell SRAM-based CIM with LUT-based Split Softmax for Attention Acceleration

Overview and Motivation

The CIMple architecture addresses critical bottlenecks in deploying transformer-based LLMs on resource-constrained edge devices. As transformer models encapsulate large parameter sets and incur quadratic complexity in self-attention, data movement and nonlinear computations—particularly softmax—dominate energy and latency costs. Traditional Compute-in-Memory (CIM) schemes, especially analog variants, are limited in supporting both configurability and nonlinear operations, leading to a reliance on off-chip computation or separate processing cores for softmax, which undermines CIM’s inherent efficiency. CIMple introduces a fully digital, standard-cell SRAM-based CIM macro with a native LUT-based split softmax, providing efficient self-attention acceleration with support for INT8 quantization. Notably, CIMple's design leverages SRAM cells as standard cells, facilitating synthesis in current digital flows and easy porting to newer technology nodes.

Figure 1: High-level view of the CIMple accelerator, highlighting the CIM core, LUT for softmax, quantization unit, intermediate and global buffers, and computation flow for diverse transformer mapping.

Architectural Innovations

CIMple’s core includes a dual-banked CIM macro (32kb), supporting 8-bit parallel weight feeding and efficient MAC operations. The design features an intermediate buffer, quantization unit narrowing from 32b to 8b, and tightly-coupled LUTs implementing a split fixed-point softmax function. The CIM uses 8-T SRAM bitcells with OAIs as both multipliers and selectors, yielding lower signaling overhead and allowing macro creation with high density and scalable placement.

Figure 2: CIMple’s architecture with CIM core, buffer, quantization, and softmax LUT modules. The macro contains 32 partitions enabling scalability and parallelism.

The split softmax implementation approximates numerator and denominator calculations using one-dimensional full-precision LUTs, enabling pipelined activation-to-activation computation. This approach mitigates latency and eliminates floating-point datatype conversions inherent to traditional softmax computation. Local storage of Q and V in SRAM, streaming K $^{\mathrm{T}}$ as input, and quantized matrix operations minimize memory accesses and intermediate data movement.

Figure 3: Diagram of the LUT-based split softmax, illustrating the separation of numerator and denominator computation, quantization, and integration with the CIM macro.

SRAM cells are synthesized as standard cells, facilitating a fully digital flow (RTL-to-GDSII) and enabling high area efficiency through distributed placement/routing. This removes manual macro layout and allows rapid porting to smaller nodes.

Figure 4: Physical layout of the CIMple accelerator, demonstrating distributed SRAM cell integration as standard cells in 28nm FD-SOI.

Transformer Mapping and Configurability

CIMple natively supports encoder-only, decoder-only, and encoder-decoder transformer mappings. Encoder-only computation exploits parallel token processing, buffering Q and V in local SRAM during attention scoring. Decoder-only computation handles autoregressive token inference, using intermediate cache for K and V, and sequential input tokens in the attention loop. Encoder-decoder models are accommodated via tiled buffer structures and repeated attention layers, with selective read/write paths in CIM and input buffers.

Figure 5: High-level schematic of transformer types—encoder-only, decoder-only, encoder-decoder—indicating parallel sequences and token mapping.

Numerical Results and Evaluation

CIMple achieves a peak energy efficiency of 26.1 TOPS/W at 0.85V and 417MHz, with area efficiency of 2.31 TOPS/mm $^2$ at 1.2V and 770MHz (post-synthesis/layout). Excluding the global buffer, energy efficiency reaches 57.9 TOPS/W and area efficiency 2.71 TOPS/mm $^2$ . Power analysis indicates the CIM core consumes 94.7% of energy, predominantly in the adder tree, whereas the softmax LUT incurs minimal overhead (0.34%).

Figure 6: Energy efficiency (TOPS/W) versus activation sparsity and voltage scaling, demonstrating CIMple's benefits from sparse matrix operations with optimal voltage at 0.85V.

Figure 7: Power consumption and area breakdown of the CIMple accelerator, highlighting dominant contribution from the CIM core and SRAM bitcells.

Compared to contemporaneous CIM transformer accelerators, CIMple outperforms in area efficiency and offers robust configurability across transformer variants. While not achieving the maximum energy efficiency of sparse/exploiting designs (e.g., MultCIM, which leverages aggressive pruning techniques impacting accuracy), the CIMple architecture maintains accuracy and flexibility.

Latency and Accuracy Impacts of Split Softmax

The LUT-based split softmax yields a 33% reduction in activation-to-activation latency, enabled by earlier pipelining and avoidance of datatype conversions and three-pass input traversal characteristic of conventional softmax. Accuracy evaluation on a quantized TinyLlama transformer model (INT8) using the lm-evaluation-harness framework demonstrates minimal performance loss (≤0.6% across several tasks), with some tasks even exhibiting marginal improvement attributable to numerical stability enhancements in LUT-based softmax.

Figure 8: Accuracy comparison between baseline PyTorch LogSoftmax and CIMple’s LUT-based split softmax on language tasks, quantized to INT8.

CIMple’s full-precision LUTs ensure that observed accuracy drops originate strictly from the softmax approximation, not quantization error or architecture-induced mismatch.

Practical and Theoretical Implications

CIMple advances CIM-based transformer acceleration through:

Enabling fully digital, configurable self-attention acceleration for INT8 transformers;
Integrating nonlinear softmax within CIM, eliminating off-core computation and data movement;
Achieving significant energy and area efficiency gains in a scalable, synthesis-friendly architecture;
Providing robust mapping flexibility across transformer variants.

This positions CIMple as a practical solution for transferring transformer capabilities to edge devices with stringent compute, power, and area constraints. The theoretical implication is the feasibility of efficient, pipelined nonlinear computation in digital CIM, challenging the canonical requirement for analog computation or external processing units.

Future Directions

Further development should target end-to-end dataflow optimizations in full transformer pipelines, extending acceleration beyond self-attention to feedforward and token-generation modules. Enhanced accuracy may be achievable via application-specific LUT tuning and retraining. External memory access remains an energy bottleneck for large LLM inference; future work must optimize on-chip-external memory mapping to minimize data movement and maximize weight reuse.

Conclusion

CIMple demonstrates a standard-cell SRAM-based CIM architecture with integrated LUT-based split softmax, offering configurable, efficient acceleration for transformer self-attention layers on resource-constrained platforms. Its innovations facilitate area, energy, and latency gains without sacrificing accuracy or configurability, and its modular, digital flow enables rapid porting and scalability. CIMple stands as an instructive blueprint for future CIM-based AI accelerators, and its principles extend to broader classes of nonlinear neural computations.

Markdown Report Issue