Efficient LLM inference solution on Intel GPU (2401.05391v2)

Published 19 Dec 2023 in cs.AR and cs.AI

Abstract: Transformer based LLMs have been widely used in many fields, and the efficiency of LLM inference becomes hot topic in real applications. However, LLMs are usually complicatedly designed in model structure with massive operations and perform inference in the auto-regressive mode, making it a challenging task to design a system with high efficiency. In this paper, we propose an efficient LLM inference solution with low latency and high throughput. Firstly, we simplify the LLM decoder layer by fusing data movement and element-wise operations to reduce the memory access frequency and lower system latency. We also propose a segment KV cache policy to keep key/value of the request and response tokens in separate physical memory for effective device memory management, helping enlarge the runtime batch size and improve system throughput. A customized Scaled-Dot-Product-Attention kernel is designed to match our fusion policy based on the segment KV cache solution. We implement our LLM inference solution on Intel GPU and publish it publicly. Compared with the standard HuggingFace implementation, the proposed solution achieves up to 7x lower token latency and 27x higher throughput for some popular LLMs on Intel GPU.

PDF Abstract

Efficient LLM Inference Solution on Intel GPU

Introduction

The paper addresses the complexity and resource-intensiveness of Transformer-based LLMs during inference, particularly on Intel GPUs. Traditional LLMs, characterized by large parameter sizes and intricate design, necessitate improved inference methods to cater to both latency-critical online applications and throughput-focused offline deployments.

Proposed Methodology

The authors present a dual approach to enhance LLM inference: structural simplification of the decoder layer and effective device memory management through a segment KV cache policy.

Model Structure Simplification:
- The LLM decoder layers are optimized by fusing data movement and element-wise operations. Notably, the paper introduces a reduction in memory access by merging operations in the Root Mean Square Layer Normalization (RMSNorm), Rotary Position Embedding (RoPE), and Scaled Dot Product Attention (SDPA) modules into single kernels.
- A novel fusion of all computations within the SDPA module is implemented, including possible index selection processes required for beam search, streamlining the operation further.
Segment KV Cache Policy:
- To counter memory consumption challenges due to auto-regressive processing principles, the authors propose storing prompt and response key/values in discrete memory segments. This method alleviates the duplication of memory storage and optimizes memory use, which facilitates larger batch sizes and thus better throughput.
- The KV cache is further optimized by dynamically adjusting the segment size in response to actual sequence lengths, reducing unnecessary memory allocation and fragmentation.

Performance Evaluation

The solution is tested on various LLMs, demonstrating significant resource savings and performance improvements on Intel GPUs. The authors report:

Latency Reduction: The proposed method achieves up to 7x reduction in token latency compared to the standard HuggingFace implementation. This improvement is attributed to optimized data movement and computation fusion.
Throughput Enhancement: Through careful memory management and structural optimization, the method achieves up to 27x higher throughput. The segment KV cache policy is pivotal, allowing increased batch size and improved hardware resource utilization.

Implications and Future Work

The results indicate notable improvements in LLM inference efficiency for applications using Intel GPU hardware. The fusion policies and memory management strategies offer a blueprint for optimizing other memory-bound LLM workloads. Looking ahead, it would be intriguing to explore the impact of these optimizations on emerging architectures and their scalability across diverse hardware environments.

Future developments could include refining these methodologies for multi-GPU setups and exploring their integration with other efficiency strategies such as quantization and pruning. The authors' approach could also inspire similar enhancements across other GPU platforms, broadening its applicability beyond Intel's GPU ecosystem.

PDF Markdown Bookmark Chat (Pro)

Authors (11)

Hui Wu (54 papers)
Yi Gan (2 papers)
Feng Yuan (262 papers)
Jing Ma (136 papers)
Wei Zhu (290 papers)
Yutao Xu (2 papers)
Hong Zhu (52 papers)
Yuhua Zhu (26 papers)
Xiaoli Liu (37 papers)
Jinghui Gu (1 paper)
Peng Zhao (162 papers)

Citations (1)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/fly51fly/status/1746649491820982630

https://twitter.com/pstAsiatech/status/1749469585047167321

https://twitter.com/niladrridas/status/1818681016883511560

https://twitter.com/ProgrammingAlx/status/1748759393108156770

https://twitter.com/AI_Arav/status/1806459448912720237

https://twitter.com/knishimae0531/status/1745800341709554080