Papers
Topics
Authors
Recent
Search
2000 character limit reached

FusionCIM: Accelerating LLM Inference with Fusion-Driven Computing-in-Memory Architecture

Published 28 Apr 2026 in cs.AR | (2604.25317v1)

Abstract: In this paper, we propose FusionCIM, an operator-fusion-driven compute-in-memory (CIM) accelerator architecture for efficient and scalable LLM inference, with three key innovations: (1) a hybrid CIM pipeline architecture that maps QKT computation on inner-product-based CIM (IP-CIM) and PV aggregation on outer-product-based CIM (OP-CIM) for efficient matrix multiplications fusion; (2) a QO-stationary dataflow that eliminates repeated KV loading in CIM and K-matrix access in buffer under transpose fusion, significantly improving data reuse on chip; and (3) a pattern-aware online-softmax mechanism that exploits distribution regularities of attention scores to reduce exponential rescaling overhead for non-linear fusion. Experimental results on LLaMA-3 model show that FusionCIM achieves up to 3.86x energy saving, and 1.98x speedup compared with prior SOTA CIM-based designs with 29.4 TOPS/W energy efficiency at the system level.

Summary

  • The paper introduces FusionCIM, a hybrid operator-fusion architecture that minimizes data movement by integrating compute-in-memory with a QO-stationary dataflow for LLM inference.
  • It achieves up to 3.85× energy reduction and 1.98× speedup by cutting off-chip memory access by nearly 59% and on-chip traffic by up to 71%.
  • The approach incorporates pattern-aware softmax scheduling to drastically reduce nonlinear rescaling, ensuring robust performance under long-context workloads.

FusionCIM: Fusion-Driven Computing-in-Memory Architecture for Efficient LLM Inference

Introduction and Motivation

The persistent challenge in LLM inference is the high overhead due to computation intensity and memory bandwidth constraints, especially under long-context workloads. Conventional solutions leveraging GPUs/TPUs, despite rapid growth in compute density, are fundamentally limited by frequent off-chip DRAM access—predominantly for activations—leading to excessive energy consumption and inferior throughput. Compute-in-memory (CIM) architectures offer a viable route by colocating arithmetic operations within memory arrays, reducing weight-activation movement. However, extant CIM-based accelerators manifest bottlenecks in activation handling, inefficient on-chip memory access due to tiling and transposition, and excessive overhead from elementwise nonlinear softmax fusion. Figure 1

Figure 1: Model framework and inference in LLM, showcasing the fundamental flow of attention and feed-forward computation.

Figure 2

Figure 2: Three challenges in current CIM architectures for LLMs and the corresponding solutions embedded in FusionCIM.

This work presents FusionCIM, which overcomes the aforementioned limitations via a hybrid operator-fusion architecture, a QO-stationary dataflow strategy, and a pattern-aware softmax scheduling approach. These innovations jointly minimize redundant data movement, off-chip/on-chip access, and nonlinear fusion workload, providing substantial improvements in system energy and area efficiency for large-scale LLM applications.

FusionCIM Architecture

FusionCIM employs a hierarchical design comprised of a global buffer, a centralized scheduler, a DRAM memory controller, and multiple hybrid compute engines (HEs). Each HE integrates inner-product CIM (IP-CIM) and outer-product CIM (OP-CIM) macros, along with a lightweight, parallel softmax pipeline. This hybrid enables full operator fusion—GEMM, transposition, and softmax aggregation—directly within the memory array, thereby minimizing extraneous memory access patterns. Figure 3

Figure 3: The overall architecture of FusionCIM, partitioned into multiple hybrid engines delivering efficient memory-compute fusion.

The IP-CIM macro is designated for QKTQK^T (attention score) computation with Q-stationary mapping, while the OP-CIM macro handles PVPV (context aggregation) through O-stationary mapping. Integration of a streamlined softmax engine allows real-time normalization within the attention pipeline, obviating the need for intermediate buffering or additional circuit resources for critical path arithmetic.

QO-Stationary Dataflow and Pipeline

Distinct from prior approaches that statically map KV pairs and repeatedly reload tiles with accompanying transposition, FusionCIM leverages a QO-stationary dataflow. Query and output activations are maintained in situ within the compute pipelines, enabling efficient, in-place transposition and maximizing data reuse. This mechanism drastically curtails repeated KV loading and associated high-energy write operations endemic to existing CIM paradigms.

Attention execution in each hybrid engine is orchestrated as a two-stage pipeline:

  1. At the vector level, KK and VV vectors are streamed in a staggered manner for successive QKTQK^T, softmax, and PVPV computation.
  2. Internally, a bit-serial pipeline processes each 8-bit vector element with cycle-aligned Taylor-expansion softmax, ensuring minimal throughput loss due to nonlinear units. Figure 4

    Figure 4: Two-level pipelined attention execution in FusionCIM, highlighting single Q loading and continuous data reuse.

Pattern-Aware Online Softmax Fusion

Standard online softmax introduces severe rescaling overheads, particularly with tile-wise scheduling that cannot anticipate score maxima until all tiles are processed. FusionCIM mitigates this by leveraging the empirical observation that attention score maxima tend to be diagonally distributed. Thus, the system implements a diagonal-first scheduling in both inter- and intra-tile scopes, identifying row maxima earlier and suppressing redundant output rescaling. Figure 5

Figure 5: Inter- and intra-tile pattern-aware scheduling for online-softmax; the diagonal-first traversal reduces softmax rescaling overhead.

Empirical analysis demonstrates that this pattern-aware softmax approach can eliminate 58–61% of output rescalings for LLaMA3 with rotary encoding, and over 98% when combined with ALiBi positional encoding—delivering marked improvement in nonlinear latency and energy efficiency without loss in standard softmax fidelity.

Experimental Evaluation

Latency and Throughput

The comparative performance across baseline digital-CIM and prior pipeline CIM architectures establishes the efficacy of FusionCIM's operator-fusion and QO-stationary dataflow. Under representative LLM workloads (LLaMA3-8B, context up to 8K), FusionCIM achieves a speedup of up to 1.98× over monolithic-CIM and 21–40% over static-KV pipeline designs as the context length increases. Figure 6

Figure 6: Normalized latency comparison—FusionCIM exhibits consistently lower latency across long-sequence inference tasks.

Energy and Memory Access

FusionCIM delivers 3.85× reduction in total energy consumption relative to conventional architectures at 4K sequence length, with the energy advantage amplifying for larger contexts due to dominant activation access cost. Figure 7

Figure 7: Normalized energy consumption across baselines, with FusionCIM's hybrid pipeline achieving maximal energy savings.

Off-chip memory access is reduced by up to 58.9% and on-chip traffic by up to 71% via QO-stationary scheduling and transposition-elimination mechanisms. Figure 8

Figure 8: Normalized off-chip memory access—FusionCIM maintains significant reduction independent of sequence length growth.

Figure 9

Figure 9: Normalized on-chip memory access and ablation studies demonstrating the contribution of each optimization within FusionCIM.

Nonlinear Rescaling Reduction

Implementing pattern-aware scheduling directly translates into substantial reductions in the number of required output rescaling operations for online softmax, robustly across various models and prompt structures. Figure 10

Figure 10: Normalized output rescaling number comparison showing substantial reduction under different network and prompt configurations.

System Efficiency

FusionCIM demonstrates 29.4 TOPS/W energy efficiency and 2.03 TOPS/mm2^2 area efficiency at system scale, surpassing prior solutions (P3ViT, TranCIM) by factors ranging from 1.26×–2.35× (energy) and 2.5×–9.2× (area). The architecture scales efficiently and exhibits manufacturability and deployability using mature 28 nm technology nodes.

Implications and Future Directions

FusionCIM advances the boundary of CIM-based LLM acceleration by resolving core activation and nonlinearity bottlenecks, enabling efficient, scalable inference for long-context transformer architectures. The integrated hybrid pipeline combined with a QO-stationary dataflow paves the way for further innovations in fusion-driven accelerator design—including dynamic tiling/adaptivity based on prompt and model structure, and seamless support for advanced attention mechanisms (e.g., grouped-query, multi-token decoding). Integration with further co-designed HW/SW scheduling and advanced positional encoding techniques offers a promising route toward ubiquitous, low-power edge deployment of LLMs.

Conclusion

FusionCIM systematically addresses data movement and on-chip resource constraints endemic to LLM inference through hybrid operator fusion, QO-stationary dataflow, and pattern-aware nonlinear scheduling. Empirical results demonstrate strong improvements—up to 3.85×\times in energy savings and 1.98×\times latency reduction—while maintaining high area and computational efficiency. The proposed architecture substantiates the viability of tightly coupled compute-memory fusion for future AI accelerators capable of sustaining next-generation LLM workloads.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.