Cambricon-LLM: A Chiplet-Based Hybrid Architecture for On-Device Inference of 70B LLM (2409.15654v1)

Published 24 Sep 2024 in cs.AR

Abstract: Deploying advanced LLMs on edge devices, such as smartphones and robotics, is a growing trend that enhances user data privacy and network connectivity resilience while preserving intelligent capabilities. However, such a task exhibits single-batch computing with incredibly low arithmetic intensity, which poses the significant challenges of huge memory footprint and bandwidth demands on limited edge resources. To address these issues, we introduce Cambricon-LLM, a chiplet-based hybrid architecture with NPU and a dedicated NAND flash chip to enable efficient on-device inference of 70B LLMs. Such a hybrid architecture utilizes both the high computing capability of NPU and the data capacity of the NAND flash chip, with the proposed hardware-tiling strategy that minimizes the data movement overhead between NPU and NAND flash chip. Specifically, the NAND flash chip, enhanced by our innovative in-flash computing and on-die ECC techniques, excels at performing precise lightweight on-die processing. Simultaneously, the NPU collaborates with the flash chip for matrix operations and handles special function computations beyond the flash's on-die processing capabilities. Overall, Cambricon-LLM enables the on-device inference of 70B LLMs at a speed of 3.44 token/s, and 7B LLMs at a speed of 36.34 token/s, which is over 22X to 45X faster than existing flash-offloading technologies, showing the potentiality of deploying powerful LLMs in edge devices.

PDF Abstract

The paper presents a comprehensive design for a chiplet-based hybrid architecture that enables on-device inference of LLMs with up to 70 billion parameters. The work is motivated by two major challenges encountered in edge inference of LLMs: the huge memory footprint and extremely low arithmetic intensity during single-batch decoding, which result in severe memory bandwidth bottlenecks and disproportionate energy costs from data movement.

The key contributions and technical innovations include:

Hybrid Chiplet Architecture

Integration of NPU and Flash with On-Die Processing:

The architecture tightly couples a neural processing unit (NPU) with a dedicated NAND flash chip that is augmented by on-die processing capabilities. The flash die is connected to the NPU via a high-speed Die-to-Die link enabled by chiplet technology. This design leverages the high storage density of flash while integrating minimal on-die computational logic to perform generalized matrix-vector (GeMV) multiplications directly within flash memory.

Task Partitioning for LLM Inference:

LLM inference is partitioned into three categories: 1. GeMV Operations: These operations, which account for the majority of computation, are executed cooperatively between the flash (using on-die processing) and the NPU to maximize parallelism and minimize redundant data transfers. 2. KV Cache Processing: The key and value cache, which is relatively small and dynamically updated, is stored in DRAM and processed exclusively by the NPU. 3. Data Transfer Operations: An optimized read request schedule manages the transfer of weights and intermediate results between flash and the NPU.

Hardware-Aware Tiling Strategy

Optimal Tile Partitioning:
- By formulating the transfer cost as a function of the tile dimensions and applying an AM–GM inequality, the authors derive optimal tile dimensions.
- This optimal mapping ensures maximal reuse of broadcast input vectors across multiple compute cores and minimizes redundant data transfer, thereby reducing energy overhead and latency.
Workload Distribution Between NPU and Flash:

The tiling strategy is paired with an optimal distribution mechanism that balances the execution times of read-compute requests (handled on flash) and typical read requests (handled over the flash channel for weight transfers). A proportion factor $\alpha$ is calculated to ensure that both domains finish their respective workloads in roughly equal time, thereby maximizing overall throughput.

On-Die Error Correction Mechanism

Outlier-Focused ECC:
- For each flash page, the top 1% largest values are identified, and their addresses along with two copies of each value are stored in a dedicated area of the spare space.
- A majority vote among the copies is used during on-die decoding, which dramatically reduces the effective bit error rate from about $1\times10^{-4}$ to as low as $3\times10^{-8}$ when two copies are employed.
- The ECC structure employs additional Hamming-code-based protection for the addresses, ensuring that the location information is reliably recovered even in the presence of errors.

Control and Scheduling Enhancements

Slice Control for Read Requests:
- Instead of transmitting large read requests that could block subsequent read-compute operations, the data is segmented into smaller “slices” that are interleaved with read-compute requests.
- This mechanism improves channel utilization significantly, reducing idle periods and preventing blockage of critical operations.
Compute Core Design within Flash Die:
- The Compute Core incorporates multiple arithmetic units, local buffers, and the lightweight error correction unit.
- Logical scheduling via dedicated Compute and Slice Controls enables flexible execution of varying GeMV operations across different weight matrix shapes.

Experimental Evaluation and Scalability

Performance Gains:
- The paper provides detailed comparisons under different flash configurations and shows that the hardware-aware tiling and request slicing contribute speedups of 1.3× to 1.4× and 1.6× to 1.8×, respectively.
Energy and Data Movement Efficiency:
- This reduction in data movement translates into a lower energy consumption per token, an essential metric for edge devices.
Scalability and Cost Benefits:
- A cost analysis demonstrates that by storing the vast majority of model weights in flash rather than DRAM, the proposed solution offers significant cost savings while still meeting the performance constraints of 70B parameter models.

Overall, the paper provides a well-rounded technical exploration that integrates architectural co-design, optimal tiling, and error correction mechanisms. The design addresses key challenges in deploying high-parameter LLMs on edge devices by synchronizing the strengths of flash memory (density and low power data movement) with the compute capabilities of NPUs, all while mitigating the inherent error susceptibilities in flash memory through a tailored on-die ECC scheme.

PDF Markdown Bookmark Chat (Pro)

Authors (15)

Zhongkai Yu (2 papers)
Shengwen Liang (11 papers)
TianYun Ma (5 papers)
Yunke Cai (2 papers)
Ziyuan Nan (5 papers)
Di Huang (203 papers)
Xinkai Song (6 papers)
Yifan Hao (28 papers)
Jie Zhang (846 papers)
Tian Zhi (3 papers)
Yongwei Zhao (9 papers)
Zidong Du (41 papers)
Xing Hu (122 papers)
Qi Guo (237 papers)
Tianshi Chen (45 papers)

Citations (2)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/WWVY/status/1838850345167925448

Reddit

[2409.15654v1] Cambricon-LLM: A Chiplet-Based Hybrid Architecture for On-Device Inference of 70B LLM (19 points, 3 comments)