Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 59 tok/s

Gemini 2.5 Pro 52 tok/s Pro

GPT-5 Medium 40 tok/s Pro

GPT-5 High 27 tok/s Pro

GPT-4o 104 tok/s Pro

Kimi K2 195 tok/s Pro

GPT OSS 120B 467 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool (2406.17565v3)

Published 25 Jun 2024 in cs.DC

Abstract: LLM serving has transformed from stateless to stateful systems, utilizing techniques like context caching and disaggregated inference. These optimizations extend the lifespan and domain of the KV cache, necessitating a new architectural approach. We present MemServe, a unified system that integrates both inter-request and intra-request optimizations. MemServe introduces MemPool, an elastic memory pool managing distributed memory and KV caches across serving instances. Using MemPool APIs, MemServe combines context caching with disaggregated inference for the first time, supported by a global scheduler that enhances cache reuse through a global prompt tree-based locality-aware policy. Tests show that MemServe significantly improves job completion time and time-to-first-time.

Citations (10)

View on Semantic Scholar