Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

38 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

41 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

176 33

RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation (2404.12457v2)

Published 18 Apr 2024 in cs.DC, cs.CL, and cs.LG

Abstract: Retrieval-Augmented Generation (RAG) has shown significant improvements in various natural language processing tasks by integrating the strengths of LLMs and external knowledge databases. However, RAG introduces long sequence generation and leads to high computation and memory costs. We propose RAGCache, a novel multilevel dynamic caching system tailored for RAG. Our analysis benchmarks current RAG systems, pinpointing the performance bottleneck (i.e., long sequence due to knowledge injection) and optimization opportunities (i.e., caching knowledge's intermediate states). Based on these insights, we design RAGCache, which organizes the intermediate states of retrieved knowledge in a knowledge tree and caches them in the GPU and host memory hierarchy. RAGCache proposes a replacement policy that is aware of LLM inference characteristics and RAG retrieval patterns. It also dynamically overlaps the retrieval and inference steps to minimize the end-to-end latency. We implement RAGCache and evaluate it on vLLM, a state-of-the-art LLM inference system and Faiss, a state-of-the-art vector database. The experimental results show that RAGCache reduces the time to first token (TTFT) by up to 4x and improves the throughput by up to 2.1x compared to vLLM integrated with Faiss.

PDF HTML Abstract

Efficient Knowledge Caching for Retrieval-Augmented Generation with RAGCache

Introduction to RAGCache

Retrieval-Augmented Generation (RAG) significantly enhances the performance of LLMs by integrating external knowledge bases. While RAG improves generation quality extensively, it faces key challenges like increased memory and computation overhead due to processing long input sequences. These challenges highlight the need for efficient mechanisms to manage and optimize resource utilization in RAG systems.

We propose RAGCache, a versatile multilevel dynamic caching system to address these issues. RAGCache innovatively organizes and manages the intermediate states of the retrieved knowledge, effectively improving the efficiency of retrieval and generation phases in RAG systems.

System Characterization and Inspiration

RAGCache derives from a detailed system characterization identifying the primary performance bottleneck at the LLM's generation step, influenced by lengthy input sequences. Historically, the augmentation of user requests with external documents leads to inflated memory and computation demands.

Investigating current caching mechanisms, we identified significant potential for improvement by caching intermediate states of frequent knowledge retrievals. Experiments demonstrate that the standard practice where only the LLM inference states are cached is insufficient due to the much longer sequences in augmented requests. Our evaluation uncovered that a tiny subset of documents often accounts for a significant number of retrieval operations, presenting an essential opportunity for caching optimization.

Architectural Overview

At its core, RAGCache introduces a knowledge tree data structure that adapts to the hierarchical memory model, efficiently organizing cached intermediate states in both GPU (for high-frequency access) and host memories (for less accessed data).

The caching mechanism operates under a bespoke prefix-aware Greedy-Dual-Size-Frequency (PGDSF) replacement policy, which considers document access frequency, size, and recency, facilitating intelligent cache management that aligns with the unique needs of RAG systems. Furthermore, RAGCache integrates a speculative computational strategy that overlaps knowledge retrieval and LLM inference, enhancing the efficiency and reducing the typical latency found in RAG processes.

Implementation and Performance

Implemented atop vLLM and evaluated using various LLM configurations and datasets, RAGCache showcases impressive performance improvements:

Reduction in Time to First Token (TTFT) by up to 4x compared to existing RAG systems on various benchmarks.
Throughput improvements by up to 2.1x, enabling faster processing rates under comparable computational resources.

Future Directions

While RAGCache marks a significant step forward, continuous improvements can further enhance RAG systems. Potential areas include more advanced predictive loading techniques for caching and exploring deeper integrations with different types of external knowledge bases, possibly extending beyond textual data to include multi-modal databases.

Conclusion

RAGCache represents a novel approach to optimize retrieval-augmented generation systems, addressing crucial performance bottlenecks through efficient knowledge caching and dynamic operational strategies. The design of RAGCache, leveraging a knowledge tree and intelligent caching policies, effectively harmonizes the retrieval and generation steps, setting a foundational system that could inspire future advancements in the field of generative AI and beyond.

PDF Markdown Bookmark Chat (Pro)

References (58)

Authors (7)

Chao Jin (30 papers)
Zili Zhang (25 papers)
Xuanlin Jiang (3 papers)
Fangyue Liu (3 papers)
Xin Liu (820 papers)
Xuanzhe Liu (59 papers)
Xin Jin (285 papers)

Citations (20)

View on Semantic Scholar

Tweets

https://twitter.com/_reachsumit/status/1782281167120777714

https://twitter.com/fly51fly/status/1782369858442961148

https://twitter.com/grandiopanda/status/1785353437321859366

https://twitter.com/betterhn20/status/1785304897841238142

https://twitter.com/batman_in_samt/status/1785298961437192296

https://twitter.com/ikulyatin/status/1785306997380063287

HackerNews

RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation (33 points, 3 comments)