Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Accelerating Retrieval-Augmented Generation (2412.15246v1)

Published 14 Dec 2024 in cs.CL, cs.AI, cs.AR, cs.DC, and cs.IR

Abstract: An evolving solution to address hallucination and enhance accuracy in LLMs is Retrieval-Augmented Generation (RAG), which involves augmenting LLMs with information retrieved from an external knowledge source, such as the web. This paper profiles several RAG execution pipelines and demystifies the complex interplay between their retrieval and generation phases. We demonstrate that while exact retrieval schemes are expensive, they can reduce inference time compared to approximate retrieval variants because an exact retrieval model can send a smaller but more accurate list of documents to the generative model while maintaining the same end-to-end accuracy. This observation motivates the acceleration of the exact nearest neighbor search for RAG. In this work, we design Intelligent Knowledge Store (IKS), a type-2 CXL device that implements a scale-out near-memory acceleration architecture with a novel cache-coherent interface between the host CPU and near-memory accelerators. IKS offers 13.4-27.9x faster exact nearest neighbor search over a 512GB vector database compared with executing the search on Intel Sapphire Rapids CPUs. This higher search performance translates to 1.7-26.3x lower end-to-end inference time for representative RAG applications. IKS is inherently a memory expander; its internal DRAM can be disaggregated and used for other applications running on the server to prevent DRAM, which is the most expensive component in today's servers, from being stranded.

Retrieval-Augmented Generation (RAG) is an increasingly popular technique to enhance LLMs by incorporating external knowledge, mitigating hallucination and enabling access to up-to-date information. A typical RAG pipeline involves a retrieval phase, where relevant documents are fetched from a knowledge source (like a vector database), and a generation phase, where an LLM synthesizes a response based on the input query and retrieved information. This paper profiles the RAG execution pipeline and identifies the retrieval phase, specifically Exact Nearest Neighbor Search (ENNS) over a dense vector database, as a critical performance bottleneck.

The authors demonstrate that while Approximate Nearest Neighbor Search (ANNS) can be faster, it often compromises retrieval quality. To maintain generation accuracy comparable to ENNS, ANNS requires fetching a larger number of documents (higher K). Providing this increased context to the LLM linearly or quadratically increases the computational cost of the generation phase, potentially negating any speedup gained during retrieval and even increasing the end-to-end inference time. Experiments show that ENNS-based RAG often achieves superior accuracy-throughput trade-offs compared to ANNS-based RAG, especially when high generation accuracy is required.

Current hardware platforms face challenges accelerating ENNS. CPUs are memory-bandwidth limited and cannot saturate DRAM bandwidth for ENNS. While GPUs can accelerate ENNS due to their SIMD capabilities, they are expensive, especially for large knowledge bases that exceed the memory capacity of a single GPU, necessitating multi-GPU setups. Furthermore, GPUs provision significant compute resources that are poorly utilized by the memory-bound ENNS workload. Specialized ANNS accelerators exist but are often task-specific due to the complex and varied nature of ANNS algorithms, whereas ENNS is algorithmically simple.

Motivated by the need for high-quality, high-performance, and cost-effective retrieval, the paper proposes the Intelligent Knowledge Store (IKS), a specialized CXL Type-2 memory expander designed to accelerate ENNS. IKS is built with three key requirements: cost-effective and scalable memory capacity, userspace-managed near-memory acceleration, and a shared address space between the host CPU and accelerators.

The IKS architecture features a scale-out near-memory processing design. It comprises multiple Near-Memory Accelerator (NMA) chips, each located near an LPDDR5X memory package. This scale-out approach addresses the physical limitations (like die shoreline) that restrict the number of memory channels accessible by a single large accelerator chip. LPDDR5X memory is chosen for its balance of cost, capacity, and bandwidth, fitting the needs of large vector databases compared to more expensive HBM or less general-purpose DDR.

IKS leverages the CXL.mem and CXL.cache protocols to implement a novel, low-overhead interface with the host CPU. The internal IKS DRAM, NMA scratchpads, and configuration registers are mapped into the host address space, allowing seamless data access and offload management via cache-coherent shared memory. The host CPU manages the vector database and initiates ENNS offloads by writing query vectors and metadata to memory-mapped context buffers and signaling completion via a doorbell register, all synchronized using CXL.cache. This cache-coherent interface is shown to provide higher throughput for offload context communication compared to non-temporal writes typical of PCIe MMIO. The CPU is notified of offload completion via a lightweight mechanism (umwait) and performs the final aggregation of partial top-K results from each NMA.

Each NMA chip contains 64 processing engines, each with a query scratchpad, dot-product unit, and Top-K unit. The architecture is designed to saturate the LPDDR5X memory bandwidth for dot-product calculations by evaluating similarity scores for blocks of embedding vectors read from DRAM in column-major order. Data is reused across processing engines to support batching efficiently. The Top-K unit maintains the best similarity scores found so far, offloading this task from the CPU.

The evaluation uses a cycle-approximate simulator calibrated with RTL synthesis data and real software stack overheads. Experiments on RAG workloads (FiDT5, Llama-8B, Llama-70B) demonstrate that IKS provides 13.4x to 27.9x faster ENNS retrieval for a 512GB vector database compared to an Intel Sapphire Rapids CPU. When integrated into the end-to-end RAG pipeline (with generation on an NVIDIA H100 GPU), IKS reduces inference time by 1.7x to 26.3x depending on the model and configuration. The faster retrieval allows IKS-accelerated RAG to achieve significantly higher throughput than ANNS-based systems while maintaining high generation accuracy.

The authors discuss trade-offs, noting that IKS is particularly effective for applications requiring high recall or datasets not well-suited for existing ANNS algorithms. Potential inefficiencies include full corpus scan and low NMA utilization at small batch sizes, suggesting future work on early termination and power management techniques. Compared to GPUs, IKS is projected to be more cost-effective for large capacities due to the use of LPDDR5X and a smaller silicon area for the NMAs.

In conclusion, the paper identifies the critical role of high-quality retrieval in RAG performance and proposes IKS as a purpose-built, cost-effective CXL device for accelerating ENNS. IKS leverages near-memory processing and a cache-coherent host interface to significantly improve retrieval performance and, consequently, the end-to-end inference time of RAG applications.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Derrick Quinn (1 paper)
  2. Mohammad Nouri (1 paper)
  3. Neel Patel (26 papers)
  4. John Salihu (1 paper)
  5. Alireza Salemi (21 papers)
  6. Sukhan Lee (4 papers)
  7. Hamed Zamani (88 papers)
  8. Mohammad Alian (7 papers)

HackerNews