FlashBack:Efficient Retrieval-Augmented Language Modeling for Long Context Inference
Abstract: Retrieval-Augmented Language Modeling (RALM) by integrating LLMs (LLM) with relevant documents from an external corpus is a proven method for enabling the LLM to generate information beyond the scope of its pre-training corpus. Previous work utilizing retrieved content by simply prepending it to the input poses a high runtime issue, which degrades the inference efficiency of the LLMs because they fail to use the Key-Value (KV) cache efficiently. In this paper, we propose FlashBack, a modular RALM designed to improve the inference efficiency of RALM with appending context pattern while maintaining decent performance after fine-tuning by Low-Rank Adaption. FlashBack appends retrieved documents at the end of the context for efficiently utilizing the KV cache instead of prepending them. And we introduce Marking Token as two special prompt tokens for marking the boundary of the appending context during fine-tuning. Our experiments on testing generation quality show that FlashBack can remain decent generation quality in perplexity. And the inference speed of FlashBack is up to $4\times$ faster than the prepending counterpart on a 7B LLM (Llama 2) in the runtime test. Via bypassing unnecessary re-computation, it demonstrates an advancement by achieving significantly faster inference speed, and this heightened efficiency will substantially reduce inferential cost.
- Reliable, adaptable, and attributable language models with retrieval, 2024.
- Improving language models by retrieving from trillions of tokens, 2022.
- Language models are few-shot learners, 2020.
- Uprise: Universal prompt retrieval for improving zero-shot evaluation, 2023.
- The pile: An 800gb dataset of diverse text for language modeling, 2020.
- Retrieval-augmented generation for large language models: A survey, 2024.
- Realm: Retrieval-augmented language model pre-training, 2020.
- Lora: Low-rank adaptation of large language models, 2021.
- Raven: In-context learning with retrieval augmented encoder-decoder language models, 2023.
- Atlas: Few-shot learning with retrieval augmented language models, 2022.
- Damjan Kalajdzievski. Scaling laws for forgetting when fine-tuning large language models, 2024.
- Generalization through memorization: Nearest neighbor language models, 2020.
- Retrieval-augmented generation for knowledge-intensive nlp tasks, 2021.
- Pyserini: A python toolkit for reproducible information retrieval research with sparse and dense representations. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21, page 2356–2362, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450380379. doi: 10.1145/3404835.3463238. URL https://doi.org/10.1145/3404835.3463238.
- Ra-dit: Retrieval-augmented dual instruction tuning, 2023.
- Locating and editing factual associations in gpt, 2023.
- Pointer sentinel mixture models, 2016.
- Efficiently scaling transformer inference, 2022.
- Exploring the limits of transfer learning with a unified text-to-text transformer, 2023.
- In-context retrieval-augmented language models, 2023.
- Context compression for auto-regressive transformers with sentinel tokens, 2023.
- The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends in Information Retrieval, 3:333–389, 01 2009. doi: 10.1561/1500000019.
- Replug: Retrieval-augmented black-box language models, 2023.
- Llama: Open and efficient foundation language models, 2023.
- Attention is all you need, 2023.
- Instructretro: Instruction tuning post retrieval-augmented pretraining, 2023a.
- Shall we pretrain autoregressive language models with retrieval? a comprehensive study, 2023b.
- Knn-lm does not improve open-ended text generation, 2023c.
- Transformers: State-of-the-art natural language processing. In Qun Liu and David Schlangen, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, October 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.6. URL https://aclanthology.org/2020.emnlp-demos.6.
- Recomp: Improving retrieval-augmented lms with compression and selective augmentation, 2023.
- Augmentation-adapted retriever improves generalization of language models as generic plug-in, 2023.
- Opt: Open pre-trained transformer language models, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.