Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 71 tok/s

Gemini 2.5 Pro 52 tok/s Pro

GPT-5 Medium 18 tok/s Pro

GPT-5 High 15 tok/s Pro

GPT-4o 101 tok/s Pro

Kimi K2 196 tok/s Pro

GPT OSS 120B 467 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

DéjàVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving (2403.01876v1)

Published 4 Mar 2024 in cs.DC

Abstract: Distributed LLM serving is costly and often underutilizes hardware accelerators due to three key challenges: bubbles in pipeline-parallel deployments caused by the bimodal latency of prompt and token processing, GPU memory overprovisioning, and long recovery times in case of failures. In this paper, we propose D\'ej`aVu, a system to address all these challenges using a versatile and efficient KV cache streaming library (D\'ej`aVuLib). Using D\'ej`aVuLib, we propose and implement efficient prompt-token disaggregation to reduce pipeline bubbles, microbatch swapping for efficient GPU memory management, and state replication for fault-tolerance. We highlight the efficacy of these solutions on a range of large models across cloud deployments.

References (36)

Citations (13)

View on Semantic Scholar

Collections

Summary

The paper introduces efficient KV-cache streaming with prompt-token disaggregation, microbatch swapping, and state replication to improve latency and fault tolerance.
The microbatch swapping technique dynamically manages GPU memory, enabling up to 1.8× throughput improvement by reducing memory footprint.
The state replication mechanism ensures rapid recovery from failures, significantly minimizing downtime in distributed LLM operations.

DéjàVu: Enhancing Generative LLM Serving with KV-cache Streaming

Introduction to \dejavu

The advent of distributed serving for LLMs like GPT-3, OPT, and BLOOM has transformed numerous applications, including chatbots, code generation, and text summarization. However, the underlying infrastructure often encounters inefficiencies, such as hardware underutilization, overprovisioned GPU memory, and long recovery periods post-failure. Addressing these challenges, the paper presents \dejavu, a novel system designed to ameliorate these inefficiencies through an efficient Key-Value (KV) cache streaming library, termed \dejavulib. This system introduces prompt-token disaggregation, microbatch swapping, and state replication to foster a fast, scalable, and fault-tolerant environment for generative LLM serving.

Disaggregating Prompt Processing from Token Generation

A significant latency disparity exists between prompt processing and token generation, contributing to pipeline inefficiencies in distributed LLM serving. \dejavu addresses this by disaggregating prompt processing from token generation, allocating distinct resources to each task to match their respective demands efficiently. This segregation aids in minimizing stall times or 'pipeline bubbles', thereby maximizing GPU utilization. The implementation of this approach leverages \dejavulib for efficient KV cache management, ensuring prompt data is swiftly and seamlessly transferred between the disaggregated tasks.

Microbatch Swapping for GPU Memory Management

Conventional systems tend to overprovision GPU memory by preallocating memory space for KV caches of all microbatches, despite their sequential processing. \dejavu introduces a novel approach called microbatch swapping, which dynamically manages GPU memory by swapping KV cache states between the GPU and the CPU. This strategy significantly reduces the GPU memory footprint, allowing for higher batch processing capacities and better resource utilization.

State Replication for Fault Tolerance

The paper identifies the stateful nature of LLM serving—inherent due to KV caching—as a vulnerability, particularly in distributed setups prone to hardware or software failures. \dejavu enhances fault tolerance through state replication, ensuring that KV cache states are replicated across different nodes. This approach minimizes the redundant computations required to restore lost states upon failures, thus reducing recovery times and improving overall reliability.

Evaluating \dejavu: Findings and Implications

The empirical evaluation of \dejavu highlights its effectiveness across different metrics. When compared to existing LLM serving systems, \dejavu demonstrates up to a 2 $\times$ improvement in throughput efficiency. Moreover, the microbatch swapping mechanism enables up to a 1.8 $\times$ throughput improvement by supporting larger batch sizes. In scenarios with system failures, \dejavu achieves a notable decrease in recovery time, thereby affirming its fault-tolerant capability.

Beyond the immediate performance enhancements, the implications of \dejavu's contributions are vast. By addressing the inefficiencies in distributed LLM serving, \dejavu not only enhances the utilization of computational resources but also opens up new possibilities for serving larger and more complex models efficiently. Future developments could explore further optimizations in KV cache management and fault tolerance, alongside expanding \dejavu's applicability to a broader range of distributed AI serving tasks.

Conclusion

In summary, \dejavu emerges as a pivotal solution to the prevailing challenges in the distributed serving of generative LLMs. Through innovative strategies like prompt-token disaggregation, microbatch swapping, and state replication, \dejavu sets a new standard for efficiency, scalability, and resilience in LLM serving systems. As the field continues to evolve, the foundational principles and mechanisms introduced by \dejavu will undoubtedly influence future research and development directions in AI and machine learning infrastructure.