LLM as a System Service on Mobile Devices (2403.11805v1)
Abstract: Being more powerful and intrusive into user-device interactions, LLMs are eager for on-device execution to better preserve user privacy. In this work, we propose a new paradigm of mobile AI: LLM as a system service on mobile devices (LLMaaS). Unlike traditional DNNs that execute in a stateless manner, such a system service is stateful: LLMs execution often needs to maintain persistent states (mainly KV cache) across multiple invocations. To minimize the LLM context switching overhead under tight device memory budget, this work presents LLMs, which decouples the memory management of app and LLM contexts with a key idea of fine-grained, chunk-wise, globally-optimized KV cache compression and swapping. By fully leveraging KV cache's unique characteristics, it proposes three novel techniques: (1) Tolerance-Aware Compression: it compresses chunks based on their measured accuracy tolerance to compression. (2) IO-Recompute Pipelined Loading: it introduces recompute to swapping-in for acceleration. (3) Chunk Lifecycle Management: it optimizes the memory activities of chunks with an ahead-of-time swapping-out and an LCTRU (Least Compression-Tolerable and Recently-Used) queue based eviction. In evaluations conducted on well-established traces and various edge devices, \sys reduces context switching latency by up to 2 orders of magnitude when compared to competitive baseline solutions.
- 2024. AICore. https://developer.android.com/ml/aicore.
- 2024. Andriod low-memory killer. https://developer.android.com/topic/performance/memory-management#low-memory_killer.
- 2024. Gboard Smart Reply. https://developers.google.com/ml-kit/language/smart-reply.
- 2024. Glarity. https://glarity.app/.
- 2024. GPT-based email writer. https://hix.ai/ai-email-writer-email-generator.
- 2024. GPT4-Turbo. https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo.
- 2024. Jetson Orin NX. https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/.
- 2024. Jetson TX2. https://developer.nvidia.com/embedded/jetson-tx2.
- 2024. Large Language Models On-Device with MediaPipe and TensorFlow Lite. https://developers.googleblog.com/2024/03/running-large-language-models-on-device-with-mediapipe-andtensorflow-lite.html.
- 2024. Llama.cpp. https://github.com/ggerganov/llama.cpp.
- 2024. LLM-based AI-Assistant. https://github.com/avsrma/LLM-based-AI-Assistant.
- 2024. LLM customer service and support. https://www.databricks.com/solutions/accelerators/llms-customer-service-and-support.
- 2024. LLM telegram chatbot. https://github.com/Fatal3xcept10n/LLM-Telegram-Chatbot.
- 2024. LM Deploy. https://github.com/InternLM/lmdeploy/tree/main.
- 2024. MI14 smartphone. https://en.wikipedia.org/wiki/Xiaomi_14.
- 2024. News Summarization with LLM. https://github.com/KillerStrike17/News-Summarization-with-LLM.
- 2024a. Pickle. https://docs.python.org/3/library/pickle.html.
- 2024b. Pickle-in-Cpp. https://github.com/Usama-Azad/Pickle-in-Cpp.
- 2024. Pytorch. https://pytorch.org/.
- 2024. Snapdragon 8 gen 3 mobile platform product brief. https://docs.qualcomm.com/bundle/publicresource/87-71408-1_REV_C_Snapdragon_8_gen_3_Mobile_Platform_Product_Brief.pdf.
- 2024. zRAM. https://en.wikipedia.org/wiki/Zram.
- APIServe: Efficient API Support for Large-Language Model Inferencing. arXiv:2402.01869 [cs.LG]
- GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
- LLM in a flash: Efficient Large Language Model Inference with Limited Memory. arXiv:2312.11514 [cs.CL]
- The Falcon Series of Open Language Models. arXiv:2311.16867 [cs.CL]
- Predicting The Next App That You Are Going To Use. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining (Shanghai, China) (WSDM ’15). Association for Computing Machinery, New York, NY, USA, 285–294. https://doi.org/10.1145/2684822.2685302
- Neural Machine Translation by Jointly Learning to Align and Translate. arXiv:1409.0473 [cs.CL]
- Findings of the 2017 Conference on Machine Translation (WMT17). In Proceedings of the Second Conference on Machine Translation, Volume 2: Shared Task Papers. Association for Computational Linguistics, Copenhagen, Denmark, 169–214. http://www.aclweb.org/anthology/W17-4717
- Language Models are Few-Shot Learners. arXiv:2005.14165 [cs.CL]
- Behavior Sequence Transformer for E-commerce Recommendation in Alibaba. arXiv:1905.06874 [cs.IR]
- Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys.org/blog/2023-03-30-vicuna/
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs.CL]
- A Survey on In-context Learning. arXiv:2301.00234 [cs.CL]
- GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv:2210.17323 [cs.LG]
- Gemma. (2024). https://doi.org/10.34740/KAGGLE/M/3301
- Prompt Cache: Modular Attention Reuse for Low-Latency Inference. arXiv:2311.04934 [cs.CL]
- SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization. Association for Computational Linguistics, Hong Kong, China, 70–79. https://doi.org/10.18653/v1/D19-5409
- MARS𝑀𝐴𝑅𝑆MARSitalic_M italic_A italic_R italic_S : Mobile Application Relaunching Speed-Up through Flash-Aware Page Swapping. IEEE Trans. Comput. 65, 3 (2016), 916–928. https://doi.org/10.1109/TC.2015.2428692
- Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv:1510.00149 [cs.CV]
- Measuring Massive Multitask Language Understanding. arXiv:2009.03300 [cs.CY]
- Teaching Machines to Read and Comprehend. In NIPS. 1693–1701. http://papers.nips.cc/paper/5945-teaching-machines-to-read-and-comprehend
- CogAgent: A Visual Language Model for GUI Agents. arXiv:2312.08914 [cs.CV]
- LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685 [cs.CL]
- Application-Aware Swapping for Mobile Systems. ACM Trans. Embed. Comput. Syst. 16, 5s, Article 182 (sep 2017), 19 pages. https://doi.org/10.1145/3126509
- Efficient Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
- End the Senseless Killing: Improving Memory Management for Mobile Operating Systems. In 2020 USENIX Annual Technical Conference (USENIX ATC 20). USENIX Association, 873–887. https://www.usenix.org/conference/atc20/presentation/lebeck
- Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security. arXiv preprint arXiv:2401.05459 (2024).
- Visual Instruction Tuning. arXiv:2304.08485 [cs.CV]
- The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits. arXiv:2402.17764 [cs.CL]
- PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods. https://github.com/huggingface/peft.
- Pointer Sentinel Mixture Models. arXiv:1609.07843 [cs.CL]
- Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? arXiv:2202.12837 [cs.CL]
- mllm team. 2023. mllm. https://github.com/UbiquitousLearning/mllm
- Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization. ArXiv abs/1808.08745 (2018).
- Practical prediction and prefetch for faster access to applications on mobile phones. In Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing (Zurich, Switzerland) (UbiComp ’13). Association for Computing Machinery, New York, NY, USA, 275–284. https://doi.org/10.1145/2493432.2493490
- Efficiently Scaling Transformer Inference. arXiv:2211.05102 [cs.LG]
- SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv:1606.05250 [cs.CL]
- Android in the Wild: A Large-Scale Dataset for Android Device Control. arXiv:2307.10088 [cs.LG]
- PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU. arXiv:2312.12456 [cs.LG]
- Gemini: A Family of Highly Capable Multimodal Models. arXiv:2312.11805 [cs.CL]
- MLC team. 2023. MLC-LLM. https://github.com/mlc-ai/mlc-llm
- Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL]
- Attention Is All You Need. arXiv:1706.03762 [cs.CL]
- Enabling Conversational Interaction with Mobile UI using Large Language Models. arXiv:2209.08655 [cs.HC]
- Enabling Conversational Interaction with Mobile UI using Large Language Models (CHI ’23). Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3544548.3580895
- Empowering llm to use smartphone for intelligent task automation. arXiv preprint arXiv:2308.15272 (2023).
- DroidBot-GPT: GPT-powered UI Automation for Android. arXiv:2304.07061 [cs.SE]
- Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online, 38–45. https://www.aclweb.org/anthology/2020.emnlp-demos.6
- NExT-GPT: Any-to-Any Multimodal LLM. arXiv:2309.05519 [cs.AI]
- SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. arXiv:2211.10438 [cs.CL]
- Efficient Streaming Language Models with Attention Sinks. arXiv (2023).
- Le Xiao and Xiaolin Chen. 2023. Enhancing LLM with Evolutionary Fine Tuning for News Summary Generation. arXiv:2307.02839 [cs.CL]
- LLMCad: Fast and Scalable On-device Large Language Model Inference. arXiv:2309.04255 [cs.NI]
- Penetrative AI: Making LLMs Comprehend the Physical World. arXiv:2310.09605 [cs.AI]
- A Survey of Resource-efficient LLM and Multimodal Foundation Models. arXiv:2401.08092 [cs.LG]
- GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation. arXiv:2311.07562 [cs.CV]
- Fast app launching for mobile devices using predictive user context. In Proceedings of the 10th International Conference on Mobile Systems, Applications, and Services (Low Wood Bay, Lake District, UK) (MobiSys ’12). Association for Computing Machinery, New York, NY, USA, 113–126. https://doi.org/10.1145/2307636.2307648
- EdgeFM: Leveraging Foundation Model for Open-set Learning on the Edge. arXiv:2311.10986 [cs.LG]
- EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models. arXiv:2308.14352 [cs.LG]
- Orca: A Distributed Serving System for Transformer-Based Generative Models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX Association, Carlsbad, CA, 521–538. https://www.usenix.org/conference/osdi22/presentation/yu
- Rethinking Mobile AI Ecosystem in the LLM Era. arXiv:2308.14363 [cs.AI]
- Big Bird: Transformers for Longer Sequences. arXiv:2007.14062 [cs.LG]
- OPT: Open Pre-trained Transformer Language Models. arXiv:2205.01068 [cs.CL]
- Character-level Convolutional Networks for Text Classification. In NIPS.
- H22{}_{2}start_FLOATSUBSCRIPT 2 end_FLOATSUBSCRIPTO: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models. arXiv:2306.14048 [cs.LG]
- Atom: Low-bit Quantization for Efficient and Accurate LLM Serving. arXiv:2310.19102 [cs.LG]
- Efficiently Programming Large Language Models using SGLang. arXiv:2312.07104 [cs.AI]
- DR. Swap: Energy-efficient paging for smartphones. In 2014 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED). 81–86. https://doi.org/10.1145/2627369.2627647
- SmartSwap: High-performance and user experience friendly swapping in mobile systems. In 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC). 1–6. https://doi.org/10.1145/3061639.3062317
- Wangsong Yin (4 papers)
- Mengwei Xu (62 papers)
- Yuanchun Li (37 papers)
- Xuanzhe Liu (59 papers)