Efficient LLM Inference Solution on Intel GPU
Introduction
The paper addresses the complexity and resource-intensiveness of Transformer-based LLMs during inference, particularly on Intel GPUs. Traditional LLMs, characterized by large parameter sizes and intricate design, necessitate improved inference methods to cater to both latency-critical online applications and throughput-focused offline deployments.
Proposed Methodology
The authors present a dual approach to enhance LLM inference: structural simplification of the decoder layer and effective device memory management through a segment KV cache policy.
- Model Structure Simplification:
- The LLM decoder layers are optimized by fusing data movement and element-wise operations. Notably, the paper introduces a reduction in memory access by merging operations in the Root Mean Square Layer Normalization (RMSNorm), Rotary Position Embedding (RoPE), and Scaled Dot Product Attention (SDPA) modules into single kernels.
- A novel fusion of all computations within the SDPA module is implemented, including possible index selection processes required for beam search, streamlining the operation further.
- Segment KV Cache Policy:
- To counter memory consumption challenges due to auto-regressive processing principles, the authors propose storing prompt and response key/values in discrete memory segments. This method alleviates the duplication of memory storage and optimizes memory use, which facilitates larger batch sizes and thus better throughput.
- The KV cache is further optimized by dynamically adjusting the segment size in response to actual sequence lengths, reducing unnecessary memory allocation and fragmentation.
Performance Evaluation
The solution is tested on various LLMs, demonstrating significant resource savings and performance improvements on Intel GPUs. The authors report:
- Latency Reduction: The proposed method achieves up to 7x reduction in token latency compared to the standard HuggingFace implementation. This improvement is attributed to optimized data movement and computation fusion.
- Throughput Enhancement: Through careful memory management and structural optimization, the method achieves up to 27x higher throughput. The segment KV cache policy is pivotal, allowing increased batch size and improved hardware resource utilization.
Implications and Future Work
The results indicate notable improvements in LLM inference efficiency for applications using Intel GPU hardware. The fusion policies and memory management strategies offer a blueprint for optimizing other memory-bound LLM workloads. Looking ahead, it would be intriguing to explore the impact of these optimizations on emerging architectures and their scalability across diverse hardware environments.
Future developments could include refining these methodologies for multi-GPU setups and exploring their integration with other efficiency strategies such as quantization and pruning. The authors' approach could also inspire similar enhancements across other GPU platforms, broadening its applicability beyond Intel's GPU ecosystem.