Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 82 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 18 tok/s
GPT-5 High 12 tok/s Pro
GPT-4o 96 tok/s
GPT OSS 120B 467 tok/s Pro
Kimi K2 217 tok/s Pro
2000 character limit reached

PipeRAG: Fast Retrieval-Augmented Generation via Algorithm-System Co-design (2403.05676v1)

Published 8 Mar 2024 in cs.CL

Abstract: Retrieval-augmented generation (RAG) can enhance the generation quality of LLMs by incorporating external token databases. However, retrievals from large databases can constitute a substantial portion of the overall generation time, particularly when retrievals are periodically performed to align the retrieved content with the latest states of generation. In this paper, we introduce PipeRAG, a novel algorithm-system co-design approach to reduce generation latency and enhance generation quality. PipeRAG integrates (1) pipeline parallelism to enable concurrent retrieval and generation processes, (2) flexible retrieval intervals to maximize the efficiency of pipeline parallelism, and (3) a performance model to automatically balance retrieval quality and latency based on the generation states and underlying hardware. Our evaluation shows that, by combining the three aforementioned methods, PipeRAG achieves up to 2.6$\times$ speedup in end-to-end generation latency while improving generation quality. These promising results showcase the effectiveness of co-designing algorithms with underlying systems, paving the way for the adoption of PipeRAG in future RAG systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. The wikipedia dataset. https://www.tensorflow.org/datasets/community_catalog/huggingface/wikipedia.
  2. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  3. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pp.  2206–2240. PMLR, 2022.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  5. Deeper text understanding for ir with contextual neural language modeling. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.  985–988, 2019.
  6. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. arXiv preprint arXiv:2104.08758, 2021.
  7. Surface-based retrieval reduces perplexity of retrieval-augmented language models. arXiv preprint arXiv:2305.16243, 2023.
  8. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pp.  5547–5569. PMLR, 2022.
  9. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
  10. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
  11. Retrieval augmented language model pre-training. In International conference on machine learning, pp.  3929–3938. PMLR, 2020.
  12. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management, pp.  2333–2338, 2013.
  13. Leveraging passage retrieval with generative models for open domain question answering. arXiv preprint arXiv:2007.01282, 2020.
  14. Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299, 2022.
  15. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence, 33(1):117–128, 2010.
  16. Co-design hardware and algorithm for vector search. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–15, 2023a.
  17. Chameleon: a heterogeneous and disaggregated accelerator system for retrieval-augmented language models. arXiv preprint arXiv:2310.09949, 2023b.
  18. Active retrieval augmented generation. arXiv preprint arXiv:2305.06983, 2023c.
  19. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 2019.
  20. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906, 2020.
  21. Generalization through memorization: Nearest neighbor language models. arXiv preprint arXiv:1911.00172, 2019.
  22. Nearest neighbor machine translation. arXiv preprint arXiv:2010.00710, 2020.
  23. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pp.  39–48, 2020.
  24. Internet-augmented dialogue generation. arXiv preprint arXiv:2107.07566, 2021.
  25. Pre-training via paraphrasing. Advances in Neural Information Processing Systems, 33:18470–18481, 2020a.
  26. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020b.
  27. Cedr: Contextualized embeddings for document ranking. In Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval, pp.  1101–1104, 2019.
  28. Approximate nearest neighbor algorithm based on navigable small world graphs. Information Systems, 45:61–68, 2014.
  29. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence, 42(4):824–836, 2018.
  30. Fast nearest neighbor machine translation. arXiv preprint arXiv:2105.14528, 2021.
  31. Passage re-ranking with bert. arXiv preprint arXiv:1901.04085, 2019.
  32. On the generalization ability of retrieval-enhanced transformers. arXiv preprint arXiv:2302.12128, 2023.
  33. Improving language understanding by generative pre-training. 2018.
  34. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  35. In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083, 2023.
  36. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
  37. End-to-end training of neural retrievers for open-domain question answering. arXiv preprint arXiv:2101.00408, 2021.
  38. Colbertv2: Effective and efficient retrieval via lightweight late interaction. arXiv preprint arXiv:2112.01488, 2021.
  39. Video google: A text retrieval approach to object matching in videos. In Computer Vision, IEEE International Conference on, volume 3, pp.  1470–1470. IEEE Computer Society, 2003.
  40. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. arXiv preprint arXiv:2212.10509, 2022.
  41. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  42. Knn-lm does not improve open-ended text generation. arXiv preprint arXiv:2305.14625, 2023.
  43. Why do nearest neighbor language models work? arXiv preprint arXiv:2301.02828, 2023.
Citations (17)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube