The paper presents a comprehensive design for a chiplet-based hybrid architecture that enables on-device inference of LLMs with up to 70 billion parameters. The work is motivated by two major challenges encountered in edge inference of LLMs: the huge memory footprint and extremely low arithmetic intensity during single-batch decoding, which result in severe memory bandwidth bottlenecks and disproportionate energy costs from data movement.
The key contributions and technical innovations include:
Hybrid Chiplet Architecture
- Integration of NPU and Flash with On-Die Processing:
The architecture tightly couples a neural processing unit (NPU) with a dedicated NAND flash chip that is augmented by on-die processing capabilities. The flash die is connected to the NPU via a high-speed Die-to-Die link enabled by chiplet technology. This design leverages the high storage density of flash while integrating minimal on-die computational logic to perform generalized matrix-vector (GeMV) multiplications directly within flash memory.
- Task Partitioning for LLM Inference:
LLM inference is partitioned into three categories: 1. GeMV Operations: These operations, which account for the majority of computation, are executed cooperatively between the flash (using on-die processing) and the NPU to maximize parallelism and minimize redundant data transfers. 2. KV Cache Processing: The key and value cache, which is relatively small and dynamically updated, is stored in DRAM and processed exclusively by the NPU. 3. Data Transfer Operations: An optimized read request schedule manages the transfer of weights and intermediate results between flash and the NPU.
Hardware-Aware Tiling Strategy
- Optimal Tile Partitioning:
- By formulating the transfer cost as a function of the tile dimensions and applying an AM–GM inequality, the authors derive optimal tile dimensions.
- This optimal mapping ensures maximal reuse of broadcast input vectors across multiple compute cores and minimizes redundant data transfer, thereby reducing energy overhead and latency.
- Workload Distribution Between NPU and Flash:
The tiling strategy is paired with an optimal distribution mechanism that balances the execution times of read-compute requests (handled on flash) and typical read requests (handled over the flash channel for weight transfers). A proportion factor is calculated to ensure that both domains finish their respective workloads in roughly equal time, thereby maximizing overall throughput.
On-Die Error Correction Mechanism
- Outlier-Focused ECC:
- For each flash page, the top 1% largest values are identified, and their addresses along with two copies of each value are stored in a dedicated area of the spare space.
- A majority vote among the copies is used during on-die decoding, which dramatically reduces the effective bit error rate from about to as low as when two copies are employed.
- The ECC structure employs additional Hamming-code-based protection for the addresses, ensuring that the location information is reliably recovered even in the presence of errors.
Control and Scheduling Enhancements
- Slice Control for Read Requests:
- Instead of transmitting large read requests that could block subsequent read-compute operations, the data is segmented into smaller “slices” that are interleaved with read-compute requests.
- This mechanism improves channel utilization significantly, reducing idle periods and preventing blockage of critical operations.
- Compute Core Design within Flash Die:
- The Compute Core incorporates multiple arithmetic units, local buffers, and the lightweight error correction unit.
- Logical scheduling via dedicated Compute and Slice Controls enables flexible execution of varying GeMV operations across different weight matrix shapes.
Experimental Evaluation and Scalability
- Performance Gains:
- The paper provides detailed comparisons under different flash configurations and shows that the hardware-aware tiling and request slicing contribute speedups of 1.3× to 1.4× and 1.6× to 1.8×, respectively.
- Energy and Data Movement Efficiency:
- This reduction in data movement translates into a lower energy consumption per token, an essential metric for edge devices.
- Scalability and Cost Benefits:
- A cost analysis demonstrates that by storing the vast majority of model weights in flash rather than DRAM, the proposed solution offers significant cost savings while still meeting the performance constraints of 70B parameter models.
Overall, the paper provides a well-rounded technical exploration that integrates architectural co-design, optimal tiling, and error correction mechanisms. The design addresses key challenges in deploying high-parameter LLMs on edge devices by synchronizing the strengths of flash memory (density and low power data movement) with the compute capabilities of NPUs, all while mitigating the inherent error susceptibilities in flash memory through a tailored on-die ECC scheme.