Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection (2405.16178v1)

Published 25 May 2024 in cs.CL

Abstract: LLMs augmented with retrieval exhibit robust performance and extensive versatility by incorporating external contexts. However, the input length grows linearly in the number of retrieved documents, causing a dramatic increase in latency. In this paper, we propose a novel paradigm named Sparse RAG, which seeks to cut computation costs through sparsity. Specifically, Sparse RAG encodes retrieved documents in parallel, which eliminates latency introduced by long-range attention of retrieved documents. Then, LLMs selectively decode the output by only attending to highly relevant caches auto-regressively, which are chosen via prompting LLMs with special control tokens. It is notable that Sparse RAG combines the assessment of each individual document and the generation of the response into a single process. The designed sparse mechanism in a RAG system can facilitate the reduction of the number of documents loaded during decoding for accelerating the inference of the RAG system. Additionally, filtering out undesirable contexts enhances the model's focus on relevant context, inherently improving its generation quality. Evaluation results of two datasets show that Sparse RAG can strike an optimal balance between generation quality and computational efficiency, demonstrating its generalizability across both short- and long-form generation tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. PaLM 2 technical report. CoRR, abs/2305.10403.
  2. Self-rag: Learning to retrieve, generate, and critique through self-reflection. CoRR, abs/2310.11511.
  3. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR.
  4. Language models are few-shot learners. In Advances in neural information processing systems, pages 1877–1901.
  5. Mention memory: incorporating textual knowledge into transformers through entity mention attention. arXiv preprint arXiv:2110.06176.
  6. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39.
  7. Albert Gu and Tri Dao. 2023. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752.
  8. Retrieval augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 3929–3938. PMLR.
  9. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR.
  10. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  11. Unsupervised dense information retrieval with contrastive learning. Trans. Mach. Learn. Res., 2022.
  12. Gautier Izacard and Edouard Grave. 2020. Leveraging passage retrieval with generative models for open domain question answering. arXiv preprint arXiv:2007.01282.
  13. Survey of hallucination in natural language generation. ACM Comput. Surv., 55(12):248:1–248:38.
  14. Active retrieval augmented generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 7969–7992. Association for Computational Linguistics.
  15. Ragcache: Efficient knowledge caching for retrieval-augmented generation. arXiv preprint arXiv:2404.12457.
  16. Generalization through memorization: Nearest neighbor language models. arXiv preprint arXiv:1911.00172.
  17. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR.
  18. Pre-training via paraphrasing. Advances in Neural Information Processing Systems, 33:18470–18481.
  19. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
  20. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  21. A survey on retrieval-augmented text generation. CoRR, abs/2202.01110.
  22. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978.
  23. Search augmented instruction learning. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 3717–3729. Association for Computational Linguistics.
  24. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 9802–9822. Association for Computational Linguistics.
  25. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 12076–12100. Association for Computational Linguistics.
  26. Generating benchmarks for factuality evaluation of language models. CoRR, abs/2307.06908.
  27. Training language models to follow instructions with human feedback. In NeurIPS.
  28. Parallel context windows for large language models. arXiv preprint arXiv:2212.10947.
  29. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530.
  30. Toolformer: Language models can teach themselves to use tools. CoRR, abs/2302.04761.
  31. Noam Shazeer. 2019. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150.
  32. Noam Shazeer and Mitchell Stern. 2018. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pages 4596–4604. PMLR.
  33. Large language models can be easily distracted by irrelevant context. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 31210–31227. PMLR.
  34. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  35. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
  36. Corrective retrieval augmented generation. arXiv preprint arXiv:2401.15884.
  37. Adaptive semiparametric language models. Transactions of the Association for Computational Linguistics, 9:362–373.
  38. Making retrieval-augmented language models robust to irrelevant context. CoRR, abs/2310.01558.
  39. Siren’s song in the AI ocean: A survey on hallucination in large language models. CoRR, abs/2309.01219.
  40. Qmsum: A new benchmark for query-based multi-domain meeting summarization. arXiv preprint arXiv:2104.05938.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Yun Zhu (52 papers)
  2. Jia-Chen Gu (42 papers)
  3. Caitlin Sikora (1 paper)
  4. Ho Ko (1 paper)
  5. Yinxiao Liu (8 papers)
  6. Chu-Cheng Lin (13 papers)
  7. Lei Shu (82 papers)
  8. Liangchen Luo (15 papers)
  9. Lei Meng (54 papers)
  10. Bang Liu (93 papers)
  11. Jindong Chen (21 papers)
Citations (5)