Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Accelerating Retrieval-Augmented Language Model Serving with Speculation (2401.14021v1)

Published 25 Jan 2024 in cs.LG, cs.CL, and cs.IR

Abstract: Retrieval-augmented LLMs (RaLM) have demonstrated the potential to solve knowledge-intensive NLP tasks by combining a non-parametric knowledge base with a parametric LLM. Instead of fine-tuning a fully parametric model, RaLM excels at its low-cost adaptation to the latest data and better source attribution mechanisms. Among various RaLM approaches, iterative RaLM delivers a better generation quality due to a more frequent interaction between the retriever and the LLM. Despite the benefits, iterative RaLM usually encounters high overheads due to the frequent retrieval step. To this end, we propose RaLMSpec, a speculation-inspired framework that provides generic speed-up over iterative RaLM while preserving the same model outputs through speculative retrieval and batched verification. By further incorporating prefetching, optimal speculation stride scheduler, and asynchronous verification, RaLMSpec can automatically exploit the acceleration potential to the fullest. For naive iterative RaLM serving, extensive evaluations over three LLMs on four downstream QA datasets demonstrate that RaLMSpec can achieve a speed-up ratio of 1.75-2.39x, 1.04-1.39x, and 1.31-1.77x when the retriever is an exact dense retriever, approximate dense retriever, and sparse retriever respectively compared with the baseline. For KNN-LM serving, RaLMSpec can achieve a speed-up ratio up to 7.59x and 2.45x when the retriever is an exact dense retriever and approximate dense retriever, respectively, compared with the baseline.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Zhihao Zhang (61 papers)
  2. Alan Zhu (7 papers)
  3. Lijie Yang (5 papers)
  4. Yihua Xu (5 papers)
  5. Lanting Li (1 paper)
  6. Phitchaya Mangpo Phothilimthana (11 papers)
  7. Zhihao Jia (43 papers)
Citations (7)