Neurocache: Efficient Vector Retrieval for Long-range Language Modeling (2407.02486v1)
Abstract: This paper introduces Neurocache, an approach to extend the effective context size of LLMs using an external vector cache to store its past states. Like recent vector retrieval approaches, Neurocache uses an efficient k-nearest-neighbor (kNN) algorithm to retrieve relevant past states and incorporate them into the attention process. Neurocache improves upon previous methods by (1) storing compressed states, which reduces cache size; (2) performing a single retrieval operation per token which increases inference speed; and (3) extending the retrieval window to neighboring states, which improves both LLMing and downstream task accuracy. Our experiments show the effectiveness of Neurocache both for models trained from scratch and for pre-trained models such as Llama2-7B and Mistral-7B when enhanced with the cache mechanism. We also compare Neurocache with text retrieval methods and show improvements in single-document question-answering and few-shot learning tasks. We made the source code available under: https://github.com/alisafaya/neurocache
- Longbench: A bilingual, multitask benchmark for long context understanding. Computing Research Repository, arXiv:2308.14508.
- Longformer: The long-document transformer. Computing Research Repository, arXiv:2004.05150. Version 2.
- Unlimiformer: Long-range transformers with unlimited length input. In Advances in Neural Information Processing Systems.
- Improving language models by retrieving from trillions of tokens. Computing Research Repository, arXiv:2112.04426.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- Extending context window of large language models via positional interpolation. Computing Research Repository, arXiv:2306.15595.
- Longlora: Efficient fine-tuning of long-context large language models. Computing Research Repository, arXiv:2309.12307.
- Generating long sequences with sparse transformers. Computing Research Repository, arXiv:1904.10509.
- Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988, Florence, Italy. Association for Computational Linguistics.
- A dataset of information-seeking questions and answers anchored in research papers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4599–4610, Online. Association for Computational Linguistics.
- The Pile: An 800gb dataset of diverse text for language modeling. Computing Research Repository, arXiv:2101.00027.
- SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization, pages 70–79, Hong Kong, China. Association for Computational Linguistics.
- Retrieval augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research, pages 3929–3938.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
- Advancing transformer architecture in long-context large language models: A comprehensive survey. Computing Research Repository, arXiv:2311.12351.
- Block-recurrent transformers. In Advances in Neural Information Processing Systems.
- Unsupervised dense information retrieval with contrastive learning. Transactions on Machine Learning Research.
- Mistral 7b. Computing Research Repository, arXiv:2310.06825.
- Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.
- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR).
- The NarrativeQA reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328.
- BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
- Xin Li and Dan Roth. 2002. Learning question classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics.
- Lost in the middle: How language models use long contexts. Computing Research Repository, arXiv:2307.03172.
- Xgen-7b technical report. Computing Research Repository, arXiv:2309.03450.
- OpenAI. 2023. Gpt-4 technical report. Computing Research Repository, arXiv:2303.08774.
- Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations.
- Compressive transformers for long-range sequence modelling. In International Conference on Learning Representations.
- In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics.
- Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389.
- Noam Shazeer and Mitchell Stern. 2018. Adafactor: Adaptive learning rates with sublinear memory cost. In Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, pages 4596–4604. PMLR.
- Llama: Open and efficient foundation language models. Computing Research Repository, arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. Computing Research Repository, arXiv:2307.09288.
- MuSiQue: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554.
- Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, page 6000–6010. Curran Associates, Inc.
- Memorizing transformers. In International Conference on Learning Representations.
- Retrieval meets long context large language models. Computing Research Repository, arXiv:2310.03025.
- HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium. Association for Computational Linguistics.
- Big bird: Transformers for longer sequences. In Advances in Neural Information Processing Systems, volume 33, pages 17283–17297. Curran Associates, Inc.
- Opt: Open pre-trained transformer language models. Computing Research Repository, arXiv:2205.01068.
- Ali Safaya (8 papers)
- Deniz Yuret (26 papers)