Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Punica: Multi-Tenant LoRA Serving (2310.18547v1)

Published 28 Oct 2023 in cs.DC and cs.LG

Abstract: Low-rank adaptation (LoRA) has become an important and popular method to adapt pre-trained models to specific domains. We present Punica, a system to serve multiple LoRA models in a shared GPU cluster. Punica contains a new CUDA kernel design that allows batching of GPU operations for different LoRA models. This allows a GPU to hold only a single copy of the underlying pre-trained model when serving multiple, different LoRA models, significantly enhancing GPU efficiency in terms of both memory and computation. Our scheduler consolidates multi-tenant LoRA serving workloads in a shared GPU cluster. With a fixed-sized GPU cluster, our evaluations show that Punica achieves 12x higher throughput in serving multiple LoRA models compared to state-of-the-art LLM serving systems while only adding 2ms latency per token. Punica is open source at https://github.com/punica-ai/punica .

Overview of Punica: A System for Efficient Multi-LoRA Model Serving on Shared GPU Clusters

The paper introduces Punica, a system specifically designed to optimize the serving of Low-Rank Adaptation (LoRA) models on shared GPU clusters. LoRA is widely adopted for fine-tuning large pre-trained models with domain-specific adaptations by utilizing significantly fewer trainable parameters. This process entails efficient adaptation but presents challenges in deployment due to the potentially large number of unique LoRA models that need to be served concurrently.

Punica distinguishes itself by employing a novel CUDA kernel design, the Segmented Gather Matrix-Vector Multiplication (SGMV), that facilitates the batching of operations across multiple, distinct LoRA models. This is achieved while maintaining only a single copy of the pre-trained model per GPU, thus enhancing memory efficiency and computational throughput.

Key Contributions

  1. Segmented Gather Matrix-Vector Multiplication (SGMV): The paper introduces SGMV as a critical innovation enabling concurrent batching of different LoRA models’ workloads. By allowing diverse requests to be served simultaneously, SGMV achieves efficient utilization of GPU resources.
  2. Innovative Scheduling Mechanisms: Punica implements sophisticated scheduling to optimize GPU resource utilization across multi-tenant workloads. It ensures high throughput by dynamically routing requests to a minimal set of active GPUs, balancing load without excessive resource allocation.
  3. Scalability with Minimal Latency Overhead: The system demonstrates scalability by achieving 12x higher throughput compared to existing state-of-the-art LLM serving solutions with a mere 2ms additional latency per token.

Performance Insights

Evaluation of Punica on NVIDIA A100 GPUs revealed substantial throughput gains, particularly in handling multiple LoRA models with different popularity distributions ranging from Distinct to Identical models. This was accomplished while effectively managing GPU memory through rapid, on-demand loading of LoRA parameters, which only marginally affect performance due to their small size relative to the backbone models.

The architectural choice to maintain a single instance of the backbone model per GPU is instrumental in achieving this efficiency. Furthermore, the paged KvCache strategy helps mitigate memory fragmentation issues known to negatively affect similar systems.

Implications and Future Directions

Punica's approach represents a significant advancement in serving efficiency for systems reliant on fine-tuning large-scale models. By enabling effective aggregation of model requests, it addresses a critical bottleneck in the deployment of domain-specific adaptations at scale. Its impact extends beyond immediate computational gains, offering a robust framework that can accommodate future advances in both hardware capabilities and model architectures.

Going forward, exploring integration with quantization techniques and further optimizations in the self-attention mechanisms may unlock additional efficiencies. Moreover, applying Punica in a production setting across varied LLM applications could yield practical insights, potentially inspiring enhancements in future iterations of LLM serving systems.

This research makes a compelling case for the strategic consolidation of GPU resources in multi-model inference scenarios, exemplifying a careful balance between innovative kernel design and real-world scalability concerns.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Lequn Chen (18 papers)
  2. Zihao Ye (16 papers)
  3. Yongji Wu (19 papers)
  4. Danyang Zhuo (33 papers)
  5. Luis Ceze (38 papers)
  6. Arvind Krishnamurthy (37 papers)
Citations (21)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com