EdgeCIM: A Hardware-Software Co-Design for CIM-Based Acceleration of Small Language Models

Published 13 Apr 2026 in cs.AR and cs.AI | (2604.11512v1)

Abstract: The growing demand for deploying Small LLMs (SLMs) on edge devices, including laptops, smartphones, and embedded platforms, has exposed fundamental inefficiencies in existing accelerators. While GPUs handle prefill workloads efficiently, the autoregressive decoding phase is dominated by GEMV operations that are inherently memory-bound, resulting in poor utilization and prohibitive energy costs at the edge. In this work, we present EdgeCIM, a hardware-software co-design framework that rethinks accelerator design for end-to-end decoder-only inference. At its core is a CIM macro, implemented in 65nm, coupled with a tile-based mapping strategy that balances pipeline stages, maximizing parallelism while alleviating DRAM bandwidth bottlenecks. Our simulator enables design space exploration of SLMs up to 4B parameters, identifying Pareto-optimal configurations in terms of latency and energy. Compared to an NVIDIA Orin Nano, EdgeCIM achieves up to 7.3x higher throughput and 49.59x better energy efficiency on LLaMA3.2-1B, and delivers 9.95x higher throughput than Qualcomm SA8255P on LLaMA3.2-3B. Extensive benchmarks on TinyLLaMA-1.1B, LLaMA3.2 (1B, 3B), Phi-3.5-mini-3.8B, Qwen2.5 (0.5B, 1.5B, 3B), SmolLM2-1.7B, SmolLM3-3B, and Qwen3 (0.6B, 1.7B, 4B) reveal that our accelerator, under INT4 precision, achieves on average 336.42 tokens/s and 173.02 tokens/J. These results establish EdgeCIM as a compelling solution towards real-time, energy-efficient edge-scale SLM inference.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces a hardware-software co-design that leverages compute-in-memory to optimize the GEMV-dominated decoding phase in small language models.
It employs an active-tile pipelined mapping strategy and a genetic DSE framework to balance memory transfers with computation, significantly improving throughput.
Experimental results show up to 9.95× throughput gains and 49.59× energy efficiency improvements over conventional edge accelerators.

EdgeCIM: A Hardware-Software Co-Design Approach for Compute-in-Memory Acceleration of Small LLMs on the Edge

Introduction and Motivation

Deployment of Small LLMs (SLMs) on resource-constrained edge devices necessitates significant advancements in hardware acceleration, particularly for decoder-only transformer architectures employed in interactive inference workloads. The bottleneck stems from the GEMV-dominated autoregressive decoding phase, where memory bandwidth and energy consumption overshadow matrix-matrix multiplication-dominated prefill stages. Conventional accelerators—GPUs/NPUs—offer insufficient GEMV throughput and poor energy scaling due to underutilization and frequent DRAM accesses. "EdgeCIM: A Hardware-Software Co-Design for CIM-Based Acceleration of Small LLMs" (2604.11512) introduces a novel, end-to-end CIM-based accelerator architecture and a hardware-software co-optimization framework primarily designed to optimize decoder-only SLM inference on edge platforms.

Decoder-Only Inference Characteristics

Autoregressive SLMs divide inference into two distinct phases: a GEMM-heavy prefill and a memory-bound, sequential GEMV-based decoding. Profiling demonstrates that, under typical edge scenarios (batch size = 1), the decoding phase can account for more than 70–96% of the total inference latency. This is a direct result of recurring off-chip Key-Value (KV) cache fetches, minimal data reuse, and the inherently sequential nature of token generation.

Figure 1: Inference process in decoder-only SLMs, highlighting the sequential decoding bottleneck.

CIM Accelerator Architecture and EdgeCIM Framework

The core of the EdgeCIM system is a hierarchical, tiled digital SRAM-based DCIM macro array. The macros operate in bit-serial input mode, supporting flexible INT4/INT8 quantization to maximize energy efficiency and maintain accuracy, exploiting the robustness of transformer quantization. The accelerator design incorporates several architectural features:

Tiled hierarchy consisting of clusters, tiles, and PEs, with a configurable number of elements exposed to design space exploration (DSE).
Weight-stationary mapping to optimize for MAC operation reuse and minimize costly data movement.
Dedicated functional units for auxiliary operators (activation, normalization, quantization, Softmax).
Hierarchical 2D mesh interconnect for scalable intra- and inter-tile/cluster data traffic.
Partitioned on-chip buffering and prefetching, explicitly addressing the DRAM bandwidth bottleneck.
Figure 3: The EdgeCIM co-design framework encompassing optimization and simulation.

Figure 2: High-level architecture of EdgeCIM, illustrating hierarchical clusters, tiles, and specialized units.

SLM Dataflow Mapping and Pipeline Optimization

EdgeCIM introduces an active-tile pipelined mapping strategy. Here, only a tunable subset of tiles participates in computation per partition, allowing memory transfers (preloading of weights/KV cache) for inactive tiles to overlap with GEMV stages, thus balancing parallelism against off-chip bandwidth.

Projection Stage: Query, key, and value projections are partitioned and assigned to clusters in parallel; in cases where the number of heads exceeds cluster count, sequential processing is performed.
Attention: KV cache is loaded in blocks (inspired by FlashAttention), minimizing memory stalls and enabling block-wise pipelined computation for softmax and subsequent GEMVs.
Linear and FFN: Weight matrices are partitioned, mapped onto the tile array, and reduced through hierarchical aggregators. The feed-forward stage exploits tile parallelism for the up/gate projections.
Figure 4: Dataflow architectural mapping during the GEMV-heavy decoding process in EdgeCIM.

Figure 5: Partition-based assignment of SLM weight matrices onto EdgeCIM tiles.

Hardware-Software Co-Optimization and Genetic DSE

A genetic algorithm-based DSE framework is used to automatically select optimal architectural parameters based on a weighted energy-latency cost function:

$\min_{h\in\mathcal{H}} L(h)^\alpha \cdot E(h)^{1-\alpha}$

where $L$ and $E$ represent latency and energy, and $\alpha$ controls the trade-off. The design space is parametrized over cluster, tile, PE counts, active tiles, and bus widths. The DSE leverages analytical models calibrated with CACTI and HSPICE (65nm), and extensive cross-layer modeling of data movement.

Figure 6: Energy-latency trade-off curve for LLaMA3.2-3B under varying $\alpha$ (cost function emphasis).

Experimental Results and Benchmarking

EdgeCIM was evaluated on a diverse suite of SLMs (e.g., TinyLLaMA-1.1B, LLaMA3.2-1B/3B, Phi-3.5-mini-3.8B, Qwen2.5-0.5B/1.5B/3B, SmolLM2-1.7B, SmolLM3-3B, Qwen3-0.6B/1.7B/4B) at INT4/INT8 precisions.

Key numerical results:

Under INT4 quantization, average throughput of 336.42 tokens/s and 173.02 tokens/J across models.
On LLaMA3.2-1B, throughput improvements of 7.3 $\times$ and energy efficiency improvements of 49.59 $\times$ over NVIDIA Orin Nano.
On LLaMA3.2-3B, achieves 9.95 $\times$ the throughput of Qualcomm SA8255P.
For smaller SLMs (Qwen2.5-0.5B), exceeds 1000 tokens/s at over 600 tokens/J.

Throughput and energy efficiency versus sequence length and prefill tokens are characterized, showing the expected cost scaling for increasing output sequence length due to sequential autoregressive generation.

Figure 7: Decoding energy-latency product scaling with prefill/generated tokens.

Figure 8: Throughput, energy efficiency, and area results for a range of SLMs under INT4/INT8 precision.

Figure 9: Comparative throughput and energy efficiency for EdgeCIM versus NVIDIA Jetson GPUs (INT4).

Comparison with Prior Work and Architectural Implications

Unlike previous CIM accelerators (X-Former, TranCIM, ReTransformer, iMTransformer) that targeted encoder-style kernels or isolated attention submodules, EdgeCIM is explicitly designed and optimized for end-to-end decoder-only, GEMV-dominated pipelines as deployed at the edge. The co-design and DSE framework yield normalized efficiency as high as 7.03 TOPS/W/mm²—substantially higher than prior CIM architectures supporting only partial or encoder-based inference paths.

The systematic pipeline and partitioning design, active-tile pipelining, and bandwidth-latency optimization are essential for sustaining high throughput/efficiency in a setting where DRAM bandwidth is the principal limiting factor. Area results (18.4–103.6 mm² for optimal configs) are well within the constraints of commodity edge silicon.

Implications and Future Directions

The study demonstrates that targeted co-design approaches—where both the SLM dataflow and the hardware microarchitecture are jointly optimized—yield significant gains under edge localization, energy, and latency constraints. This work provides strong evidence for a migration away from general-purpose NPU/GPU fabrics toward CIM-centric, fine-grained pipelined architectures for edge NLP inference. The active-tile and partition-based mapping can be generalized to other non-GEMM, memory-bound workloads that emerge in transformer and non-transformer models.

Future avenues involve: extending the framework to multi-modal SLMs, supporting incremental/on-device training, applying similar DSE-driven co-design to encoder-decoder or multi-batch inference, and integrating adaptive runtime scheduling for variable precision and dynamic memory-aware execution.

Conclusion

EdgeCIM establishes a systematic, co-designed methodology and hardware architecture for maximizing both throughput and energy efficiency for decoder-only SLMs on edge platforms. By explicitly focusing on GEMV-dominated phases, employing highly-configurable DCIM macros, and thoroughly exploring the latency-energy-area optimization space, substantial gains over state-of-the-art edge GPUs and prior CIM accelerators are demonstrated. These findings pave the way for highly performant and energy-scalable edge SLM deployments across a range of resource-constrained environments.

Markdown Report Issue