Papers
Topics
Authors
Recent
2000 character limit reached

KVSwap: Disk-aware KV Cache Offloading for Long-Context On-device Inference (2511.11907v1)

Published 14 Nov 2025 in cs.DC and cs.AI

Abstract: LLMs (LMs) underpin emerging mobile and embedded AI applications like meeting and video summarization and document analysis, which often require processing multiple long-context inputs. Running an LM locally on-device improves privacy, enables offline use, and reduces cost, but long-context inference quickly hits a \emph{memory capacity wall} as the key-value (KV) cache grows linearly with context length and batch size. We present KVSwap, a software framework to break this memory wall by offloading the KV cache to non-volatile secondary storage (disk). KVSwap leverages the observation that only a small, dynamically changing subset of KV entries is critical for generation. It stores the full cache on disk, uses a compact in-memory metadata to predict which entries to preload, overlaps computation with hardware-aware disk access, and orchestrates read patterns to match storage device characteristics. Our evaluation shows that across representative LMs and storage types, KVSwap delivers higher throughput under tight memory budgets while maintaining the generation quality when compared with existing KV cache offloading schemes.

Summary

We haven't generated a summary for this paper yet.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.