Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A System for Microserving of LLMs (2412.12488v1)

Published 17 Dec 2024 in cs.DC

Abstract: The recent advances in LLMs bring a strong demand for efficient system support to improve overall serving efficiency. As LLM inference scales towards multiple GPUs and even multiple compute nodes, various coordination patterns, such as prefill-decode disaggregation and context migration, arise in serving systems. Most inference services today expose a coarse-grained request-level API with a pre-configured coordination strategy, limiting the ability to customize and dynamically reconfigure the coordination. In this paper, we propose LLM microserving, a multi-level architecture for structuring and programming LLM inference services. We introduces simple yet effective microserving APIs to support fine-grained sub-request level actions. A programmable router transforms user requests into sub-request calls, enabling the dynamic reconfiguration of serving patterns. To support diverse execution patterns, we develop a unified KV cache interface that handles various KV compute, transfer, and reuse scenarios. Our evaluation shows that LLM microserving can be reconfigured to support multiple disaggregation orchestration strategies in a few lines of Python code while maintaining state-of-the-art performance for LLM inference tasks. Additionally, it allows us to explore new strategy variants that reduce up to 47% of job completion time compared to the existing strategies.

Summary

  • The paper introduces LLM microserving, a novel architecture for efficient LLM inference that enhances configurability and dynamic reconfiguration at a fine-grained sub-request level.
  • The system uses a programmable router to transform requests into sub-request calls and incorporates a unified KV cache interface for efficient management of various compute, transfer, and reuse scenarios.
  • Evaluation shows the architecture maintains state-of-the-art performance and enables novel orchestration strategies, such as balanced prefill-decode disaggregation, which reduced job completion time by approximately 47%.

The paper presents a novel architectural framework termed "LLM microserving" for improving the efficiency of LLM inference services, addressing the current limitations in configurability and dynamic reconfiguration within existing LLM serving systems. The proposed architecture introduces a highly modular and programmable system that facilitates fine-grained sub-request level processing of LLM inference tasks.

Key Contributions:

  • Microserving Architecture: The LLM microserving architecture is designed to allow dynamic reconfiguration of inference patterns by transforming user requests into sub-request calls through a programmable router. This router enables diverse orchestration strategies by supporting actions at a sub-request level, offering a significant increase in flexibility compared to traditional coarse-grained request-level APIs.
  • Unified KV Cache Interface: A crucial element of the proposed architecture is the unified Key-Value (KV) cache interface. This interface is capable of efficiently managing various KV compute, transfer, and reuse scenarios common in LLM inference, thus facilitating a range of execution patterns.
  • Performance and Flexibility: The authors demonstrate that their system not only maintains state-of-the-art performance in LLM inference but also enables the exploration of novel orchestration strategies. By refactoring merely a few lines of Python code, their system can support multiple disaggregation orchestration strategies. A notable strategy explored in this work is the balanced prefill-decode disaggregation, which can significantly reduce job completion time by about 47% compared to traditional strategies.

Evaluation:

The authors rigorously evaluate their architecture across several dimensions:

  1. Programmable Patterns: Through experimentation, it is shown that LLM microserving supports multiple serving patterns such as data parallel, prefill-decode disaggregation, and context cache migration. This is achieved with a concise and adaptable router implementation that adjusts dynamically to workload variations.
  2. Performance Benefits in Various Contexts: The system demonstrates clear performance advantages in scenarios with longer input requests, highlighting the capacity for prefill-decode disaggregation to decrease latency and balance workloads without extensive engine reconfiguration.
  3. KV Migration Efficiency: The proposed architecture efficiently leverages global context caches to reduce computation overhead during prefill operations, achieving around a 1.7× improvement in prefill time for certain input lengths by minimizing redundant KV processing.
  4. Impact of Prefill-Decode Balance: By analyzing different prefill-decode balance ratios, the paper underscores the importance of dynamic workload adjustment based on request loads, demonstrating that higher ratios can better balance system pressure in longer input scenarios.

Design and Implementation:

  • The system uses NVSHMEM, a library providing efficient GPU communication, facilitating low-overhead and asynchronous KV transfers that overlap with computation, mitigating potential bottlenecks.
  • The proposed API is simple yet effective, abstracting complex operations like KV transfer and recomputation in a way that is easily programmable and adaptable to future strategy derivations.

In conclusion, the LLM microserving system offers a robust platform for LLM inference tasks, enhancing both configurability and performance through its modular design and advanced orchestration capabilities. This work paves the way for more responsive and efficient LLM serving systems that can dynamically adapt to varying demands and computational patterns.