KVDirect: Distributed Disaggregated LLM Inference

Published 13 Dec 2024 in cs.DC, cs.LG, and cs.PF | (2501.14743v1)

Abstract: LLMs have become the new foundation for many applications, reshaping human society like a storm. Disaggregated inference, which separates prefill and decode stages, is a promising approach to improving hardware utilization and service quality. However, due to inefficient inter-node communication, existing systems restrict disaggregated inference to a single node, limiting resource allocation flexibility and reducing service capacity. This paper introduces KVDirect, which optimizes KV cache transfer to enable a distributed disaggregated LLM inference. KVDirect achieves this through the following contributions. First, we propose a novel tensor-centric communication mechanism that reduces the synchronization overhead in traditional distributed GPU systems. Second, we design a custom communication library to support dynamic GPU resource scheduling and efficient KV cache transfer. Third, we introduce a pull-based KV cache transfer strategy that reduces GPU resource idling and improves latency. Finally, we implement KVDirect as an open-source LLM inference framework. Our evaluation demonstrates that KVDirect reduces per-request latency by 55% compared to the baseline across diverse workloads under the same resource constraints.

Abstract PDF Upgrade to Chat

Summary

The paper introduces KVDirect, a framework that reduces LLM inference latency by 55% via optimized KV cache transfers.
It utilizes a novel tensor-centric RDMA communication mechanism to bypass CPU overhead, enhancing GPU-NIC interaction efficiency.
The framework implements dynamic GPU resource scheduling and a pull-based cache strategy, significantly improving scalability in distributed inference.

KVDirect: Distributed Disaggregated LLM Inference

Introduction

The paper "KVDirect: Distributed Disaggregated LLM Inference" addresses the inefficiencies in the current strategies for disaggregated inference in LLMs. Disaggregated inference separates the prefill and decode stages to enhance hardware utilization and improve service quality. However, existing solutions are restricted to single-node deployments due to constraints in inter-node communication, significantly impairing scalability and flexibility. This paper proposes KVDirect, a framework that optimizes KV cache transfer to facilitate distributed disaggregated LLM inference, achieving a notable reduction in per-request latency.

Disaggregated LLM Inference

Disaggregated inference divides the workload between two distinct workers: a prefill worker computing the KV cache for all tokens and a decode worker generating responses using this cache. Existing systems limit disaggregated inference to a single node, primarily relying on fast intra-node NV-Link for KV cache transfer, leading to resource allocation inflexibility and reduced overall service capacity. The research identifies the redundant synchronization and data movement as critical issues, which make only a small fraction of message-passing-based communication effective (Figure 1).

Figure 1: The workflow of disaggregated LLM inference with an emphasis on KV cache.

KVDirect's Innovations

The paper presents several key innovations:

Tensor-centric Communication Mechanism: KVDirect introduces a communication mechanism that significantly reduces synchronization overhead by enabling direct RDMA-based transfers. This approach bypasses the CPU, allowing efficient GPU to NIC communication.
Dynamic GPU Resource Scheduling: The system supports dynamic GPU resource scheduling that facilitates efficient KV cache transfers and optimizes resource utilization.
Pull-based KV Cache Transfer Strategy: By adopting a pull-mode strategy, decode workers read data from prefill workers. This approach minimizes GPU resource idling and optimizes latency, enhancing performance under high query per second (QPS) scenarios (Figure 2).
Figure 2: The message-based KV cache transfer with 4KB block size, where the blue and red arrows represent the communication over PCIe and network, respectively.

Experimental Results

The implementation of KVDirect demonstrated that it reduced per-request latency by 55% compared to a baseline across various workloads. This performance evaluation showcased improvements in the Time To First Token (TTFT) and Time Between Tokens (TBT), crucial metrics for LLM inference efficiency. KVDirect achieves high bandwidth use, demonstrating its effectiveness in handling long prompts and extensive response generation, without the constraints seen in previous systems (Figure 3).

Figure 3: The achieved bandwidth of UCX message-sending.

Implications and Future Work

The implications of KVDirect are substantial, offering a path toward scalable, efficient distributed inference in LLMs. This work could lead to richer, more complex LLMs that are not limited by single-node deployments, unlocking potential use cases in real-time language processing applications and large-scale data interpretation tasks.

Future work suggested by the authors includes refining resource allocation strategies to adapt to changing workloads dynamically and further reducing latency through smarter GPU resource utilization strategies. The potential for broader adoption in diverse AI applications and industries is evident, as the demand for sophisticated LLMs continues to grow.

Conclusion

KVDirect represents a significant advancement in distributed disaggregated LLM inference. By optimizing inter-node communication and resource allocation, it provides a framework for more efficient and scalable inference operations. The reduction in latency and improved resource utilization positions KVDirect as a transformative solution in the evolving landscape of large-scale AI deployments.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

KVDirect: Distributed Disaggregated LLM Inference

Summary

KVDirect: Distributed Disaggregated LLM Inference

Introduction

Disaggregated LLM Inference

KVDirect's Innovations

Experimental Results

Implications and Future Work

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (9)

Collections

Tweets

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

KVDirect: Distributed Disaggregated LLM Inference

Summary

KVDirect: Distributed Disaggregated LLM Inference

Introduction

Disaggregated LLM Inference

KVDirect's Innovations

Experimental Results

Implications and Future Work

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (9)

Collections

Tweets

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research