Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation (2404.12457v2)

Published 18 Apr 2024 in cs.DC, cs.CL, and cs.LG
RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation

Abstract: Retrieval-Augmented Generation (RAG) has shown significant improvements in various natural language processing tasks by integrating the strengths of LLMs and external knowledge databases. However, RAG introduces long sequence generation and leads to high computation and memory costs. We propose RAGCache, a novel multilevel dynamic caching system tailored for RAG. Our analysis benchmarks current RAG systems, pinpointing the performance bottleneck (i.e., long sequence due to knowledge injection) and optimization opportunities (i.e., caching knowledge's intermediate states). Based on these insights, we design RAGCache, which organizes the intermediate states of retrieved knowledge in a knowledge tree and caches them in the GPU and host memory hierarchy. RAGCache proposes a replacement policy that is aware of LLM inference characteristics and RAG retrieval patterns. It also dynamically overlaps the retrieval and inference steps to minimize the end-to-end latency. We implement RAGCache and evaluate it on vLLM, a state-of-the-art LLM inference system and Faiss, a state-of-the-art vector database. The experimental results show that RAGCache reduces the time to first token (TTFT) by up to 4x and improves the throughput by up to 2.1x compared to vLLM integrated with Faiss.

Efficient Knowledge Caching for Retrieval-Augmented Generation with RAGCache

Introduction to RAGCache

Retrieval-Augmented Generation (RAG) significantly enhances the performance of LLMs by integrating external knowledge bases. While RAG improves generation quality extensively, it faces key challenges like increased memory and computation overhead due to processing long input sequences. These challenges highlight the need for efficient mechanisms to manage and optimize resource utilization in RAG systems.

We propose RAGCache, a versatile multilevel dynamic caching system to address these issues. RAGCache innovatively organizes and manages the intermediate states of the retrieved knowledge, effectively improving the efficiency of retrieval and generation phases in RAG systems.

System Characterization and Inspiration

RAGCache derives from a detailed system characterization identifying the primary performance bottleneck at the LLM's generation step, influenced by lengthy input sequences. Historically, the augmentation of user requests with external documents leads to inflated memory and computation demands.

Investigating current caching mechanisms, we identified significant potential for improvement by caching intermediate states of frequent knowledge retrievals. Experiments demonstrate that the standard practice where only the LLM inference states are cached is insufficient due to the much longer sequences in augmented requests. Our evaluation uncovered that a tiny subset of documents often accounts for a significant number of retrieval operations, presenting an essential opportunity for caching optimization.

Architectural Overview

At its core, RAGCache introduces a knowledge tree data structure that adapts to the hierarchical memory model, efficiently organizing cached intermediate states in both GPU (for high-frequency access) and host memories (for less accessed data).

The caching mechanism operates under a bespoke prefix-aware Greedy-Dual-Size-Frequency (PGDSF) replacement policy, which considers document access frequency, size, and recency, facilitating intelligent cache management that aligns with the unique needs of RAG systems. Furthermore, RAGCache integrates a speculative computational strategy that overlaps knowledge retrieval and LLM inference, enhancing the efficiency and reducing the typical latency found in RAG processes.

Implementation and Performance

Implemented atop vLLM and evaluated using various LLM configurations and datasets, RAGCache showcases impressive performance improvements:

  • Reduction in Time to First Token (TTFT) by up to 4x compared to existing RAG systems on various benchmarks.
  • Throughput improvements by up to 2.1x, enabling faster processing rates under comparable computational resources.

Future Directions

While RAGCache marks a significant step forward, continuous improvements can further enhance RAG systems. Potential areas include more advanced predictive loading techniques for caching and exploring deeper integrations with different types of external knowledge bases, possibly extending beyond textual data to include multi-modal databases.

Conclusion

RAGCache represents a novel approach to optimize retrieval-augmented generation systems, addressing crucial performance bottlenecks through efficient knowledge caching and dynamic operational strategies. The design of RAGCache, leveraging a knowledge tree and intelligent caching policies, effectively harmonizes the retrieval and generation steps, setting a foundational system that could inspire future advancements in the field of generative AI and beyond.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. 2024. LangChain. https://python.langchain.com/docs/get_started/introduction. (2024).
  2. 2024. OpenAI. https://openai.com/. (2024).
  3. 2024. OpenAI text-embedding-3 model. https://openai.com/blog/new-embedding-models-and-api-updates/. (2024).
  4. 2024. Pinecone: Introduction to Facebook AI Similarity Search (Faiss). (2024). https://www.pinecone.io/learn/series/faiss/faiss-tutorial/.
  5. 2024. Wikipedia (en) embedded with cohere.ai multilingual-22-12 encoder. https://huggingface.co/datasets/Cohere/wikipedia-22-12-en-embeddings/. (2024).
  6. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245 (2023).
  7. Artem Babenko and Victor Lempitsky. 2014. The inverted multi-index. IEEE transactions on pattern analysis and machine intelligence (2014).
  8. Improving language models by retrieving from trillions of tokens. In International Conference on Machine Learning (ICML).
  9. Understanding retrieval augmentation for long-form question answering. arXiv preprint arXiv:2310.12150 (2023).
  10. Benchmarking large language models in retrieval-augmented generation. In AAAI Conference on Artificial Intelligence.
  11. Spann: Highly-efficient billion-scale approximate nearest neighborhood search. Advances in Neural Information Processing Systems (2021).
  12. Ludmila Cherkasova. 1998. Improving WWW proxies performance with greedy-dual-size-frequency caching policy. Hewlett-Packard Laboratories Palo Alto, CA, USA.
  13. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022).
  14. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems (2022).
  15. Fast approximate nearest neighbor search with the navigating spreading-out graph. In Proceedings of the VLDB Endowment.
  16. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801 (2023).
  17. Prompt cache: Modular attention reuse for low-latency inference. arXiv preprint arXiv:2311.04934 (2023).
  18. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300 (2020).
  19. Diskann: Fast accurate billion-point nearest neighbor search on a single node. Advances in Neural Information Processing Systems (2019).
  20. Mistral 7B. arXiv preprint arXiv:2310.06825 (2023).
  21. Mixtral of experts. arXiv preprint arXiv:2401.04088 (2024).
  22. Piperag: Fast retrieval-augmented generation via algorithm-system co-design. arXiv preprint arXiv:2403.05676 (2024).
  23. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
  24. Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp. arXiv preprint arXiv:2212.14024 (2022).
  25. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics (2019).
  26. Efficient memory management for large language model serving with pagedattention. In ACM SOSP.
  27. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems (2020).
  28. Improving approximate nearest neighbor search through learned adaptive early termination. In ACM SIGMOD.
  29. Train big, then compress: Rethinking model size for efficient training and inference of transformers. In International Conference on Machine Learning (ICML).
  30. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics (2024).
  31. CacheGen: Fast Context Loading for Language Model Applications. arXiv preprint arXiv:2310.07240 (2023).
  32. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems (2024).
  33. Reacc: A retrieval-augmented code completion framework. arXiv preprint arXiv:2203.07722 (2022).
  34. Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence (2018).
  35. OpenAI. 2023. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774 (2023).
  36. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems (2019).
  37. In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics (2023).
  38. Fundamentals of queueing theory. Vol. 399. John Wiley & Sons.
  39. Improving the domain adaptation of retrieval augmented generation (RAG) models for open domain question answering. Transactions of the Association for Computational Linguistics (2023).
  40. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages.
  41. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023. doi: 10.48550. arXiv preprint arXiv.2302.13971 (2023).
  42. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. arXiv preprint arXiv:2212.10509 (2022).
  43. Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. In Chi conference on human factors in computing systems extended abstracts.
  44. Attention is all you need. Advances in neural information processing systems (2017).
  45. Fast distributed inference serving for large language models. arXiv preprint arXiv:2305.05920 (2023).
  46. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453 (2023).
  47. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In Conference on Empirical Methods in Natural Language Processing (EMNLP).
  48. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems (2024).
  49. ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition. arXiv preprint arXiv:2402.15220 (2024).
  50. Orca: A Distributed Serving System for {{\{{Transformer-Based}}\}} Generative Models. In USENIX OSDI.
  51. Prompting large language model for machine translation: A case study. In International Conference on Machine Learning (ICML).
  52. Benchmarking large language models for news summarization. Transactions of the Association for Computational Linguistics (2024).
  53. Fast, Approximate Vector Queries on Very Large Unstructured Datasets. In USENIX NSDI.
  54. Fast Vector Query Processing for Large Datasets Beyond GPU Memory with Reordered Pipelining. In USENIX NSDI.
  55. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems (2024).
  56. Accelerating retrieval-augmented language model serving with speculation. arXiv preprint arXiv:2401.14021 (2024).
  57. Efficiently programming large language models using sglang. arXiv preprint arXiv:2312.07104 (2023).
  58. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. arXiv preprint arXiv:2401.09670 (2024).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Chao Jin (30 papers)
  2. Zili Zhang (25 papers)
  3. Xuanlin Jiang (3 papers)
  4. Fangyue Liu (3 papers)
  5. Xin Liu (820 papers)
  6. Xuanzhe Liu (59 papers)
  7. Xin Jin (285 papers)
Citations (20)