CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference (2401.11240v1)

Published 20 Jan 2024 in cs.DC

Abstract: Pre-trained LLMs often need specialization for domain-specific tasks. Low-Rank Adaptation (LoRA) is a popular approach that adapts a base model to multiple tasks by adding lightweight trainable adapters. In this paper, we present CaraServe, a system that efficiently serves many LoRA adapters derived from a common base model. CaraServe maintains the base model on GPUs and dynamically loads activated LoRA adapters from main memory. As GPU loading results in a cold-start that substantially delays token generation, CaraServe employs a CPU-assisted approach. It early starts the activated adapters on CPUs for prefilling as they are being loaded onto GPUs; after loading completes, it then switches to the GPUs for generative LoRA inference. CaraServe develops a highly optimized synchronization mechanism to efficiently coordinate LoRA computation on the CPU and GPU. Moreover, CaraServe employs a rank-aware scheduling algorithm to optimally schedule heterogeneous LoRA requests for maximum service-level objective (SLO) attainment. We have implemented CaraServe and evaluated it against state-of-the-art LoRA serving systems. Our results demonstrate that CaraServe can speed up the average request serving latency by up to 1.4$\times$ and achieve an SLO attainment of up to 99%.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (35)

Authors (9)

Suyi Li (26 papers)
Hanfeng Lu (3 papers)
Tianyuan Wu (5 papers)
Minchen Yu (6 papers)
Qizhen Weng (5 papers)
Xusheng Chen (12 papers)
Yizhou Shan (15 papers)
Binhang Yuan (45 papers)
Wei Wang (1793 papers)

Citations (8)

View on Semantic Scholar

CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference (2401.11240v1)

Related Papers

Tweets