High-Efficiency Deployment of LLMs on Mobile Phone GPUs
Introduction
The paper introduces methodologies for deploying LLMs on mobile device GPUs efficiently. Given the computational and memory bandwidth constraints inherent in mobile phones, existing methods result in slower inference speeds, adversely affecting user experience. The authors propose a suite of optimization techniques aimed at addressing these challenges. These techniques comprise a symbolic expression-based approach for dynamic shape model inference, operator optimization and execution priority setting, an FP4 quantization method named M0E4, and a sub-tensor-based technique for KV cache. The implementation of these optimizations in the proposal of a new mobile inference engine, Transformer-Lite, is discussed, highlighting its effectiveness in deploying LLMs on mobile platforms with substantial speed improvements over existing solutions.
Key Optimizations
The paper outlines four primary optimization strategies aimed at enhancing LLM deployment on device GPUs:
- Symbolic Expression-Based Dynamic Shape Inference: This approach addresses the challenge of dynamic input shapes in LLM deployment by employing symbolic expressions to infer and manage the dynamic shape of tensors efficiently.
- Operator and Lagging Optimizations: Enhancements include setting execution priorities and optimizing operators for improved performance and reduced device lag, focusing on the peculiarities of LLM operations like matrix multiplication.
- M0E4 FP4 Quantization Method: Introduces a quantization technique that minimizes performance overhead during dequantization, allowing for efficient matrix multiplication with half-precision activation and 4-bit quantized weights.
- Sub-Tensor-Based KV Cache Optimization: Offers a method to eliminate redundant copying of the KV cache post-inference by utilizing sub-tensor technology, thereby reducing memory consumption and inference time.
Experimental Evaluation
The empirical evaluation demonstrates Transformer-Lite's superior performance compared with CPU-based FastLLM and GPU-based MLC-LLM engines. Remarkable speedups of over 10x for prefill and 2~3x for decoding speeds affirm the efficacy of the proposed optimizations across various LLM architectures and parameter sizes ranging from 2B to 14B. Furthermore, the engine's success in deploying models up to 14B parameters on mobile devices underscores the potential to bring advanced AI applications directly to end-users without compromising performance.
Implications and Future Work
These optimizations have notable implications for the deployment of LLMs on mobile devices, offering a pathway to achieving high-efficiency, real-time AI applications directly on user devices. The advancements not only promise improved user experiences by enabling faster inference times but also hint at a significant reduction in reliance on cloud-based models, thus enhancing privacy and accessibility of AI technologies. Looking forward, the exploration of more efficient matrix multiplication implementations, the incorporation of additional acceleration techniques, and the refinement of model structures represent potential areas for further improvement in deploying LLMs on mobile GPUs.
Summary
This paper presents Transformer-Lite, a new mobile inference engine integrating a suite of optimization techniques for efficient deployment of LLMs on mobile devices. The proposed methods demonstrate remarkable speed improvements, establishing a solid foundation for the future development of mobile-based AI applications. The exploratory findings hint at significant potential for advancements in on-device AI processing, suggesting an exciting trajectory for research and development in mobile AI technologies.