LLMCad: Fast and Scalable On-device Large Language Model Inference (2309.04255v1)

Published 8 Sep 2023 in cs.NI and cs.AI

Abstract: Generative tasks, such as text generation and question answering, hold a crucial position in the realm of mobile applications. Due to their sensitivity to privacy concerns, there is a growing demand for their execution directly on mobile devices. Currently, the execution of these generative tasks heavily depends on LLMs. Nevertheless, the limited memory capacity of these devices presents a formidable challenge to the scalability of such models. In our research, we introduce LLMCad, an innovative on-device inference engine specifically designed for efficient generative NLP tasks. The core idea behind LLMCad revolves around model collaboration: a compact LLM, residing in memory, takes charge of generating the most straightforward tokens, while a high-precision LLM steps in to validate these tokens and rectify any identified errors. LLMCad incorporates three novel techniques: (1) Instead of generating candidate tokens in a sequential manner, LLMCad employs the smaller LLM to construct a token tree, encompassing a wider range of plausible token pathways. Subsequently, the larger LLM can efficiently validate all of these pathways simultaneously. (2) It employs a self-adjusting fallback strategy, swiftly initiating the verification process whenever the smaller LLM generates an erroneous token. (3) To ensure a continuous flow of token generation, LLMCad speculatively generates tokens during the verification process by implementing a compute-IO pipeline. Through an extensive series of experiments, LLMCad showcases an impressive token generation speed, achieving rates up to 9.3x faster than existing inference engines.

PDF Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

Authors (7)

Daliang Xu (9 papers)
Wangsong Yin (4 papers)
Xin Jin (285 papers)
Ying Zhang (388 papers)
Shiyun Wei (2 papers)
Mengwei Xu (62 papers)
Xuanzhe Liu (59 papers)

Citations (31)

View on Semantic Scholar

LLMCad: Fast and Scalable On-device Large Language Model Inference (2309.04255v1)

Related Papers