XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference (2404.15420v3)
Abstract: In-context learning (ICL) approaches typically leverage prompting to condition decoder-only LLM generation on reference information. Just-in-time processing of a context is inefficient due to the quadratic cost of self-attention operations, and caching is desirable. However, caching transformer states can easily require almost as much space as the model parameters. When the right context isn't known in advance, caching ICL can be challenging. This work addresses these limitations by introducing models that, inspired by the encoder-decoder architecture, use cross-attention to condition generation on reference text without the prompt. More precisely, we leverage pre-trained decoder-only models and only train a small number of added layers. We use Question-Answering (QA) as a testbed to evaluate the ability of our models to perform conditional generation and observe that they outperform ICL, are comparable to fine-tuned prompted LLMs, and drastically reduce the space footprint relative to standard KV caching by two orders of magnitude.
- Evaluating correctness and faithfulness of instruction-following models for question answering. arXiv preprint arXiv:2307.16877.
- Topiocqa: Open-domain conversational question answering with topic switching. Transactions of the Association for Computational Linguistics, 10:468–483.
- Keyformer: Kv cache reduction through key tokens selection for efficient generative inference. arXiv preprint arXiv:2403.09054.
- GQA: Training generalized multi-query transformer models from multi-head checkpoints. In The 2023 Conference on Empirical Methods in Natural Language Processing.
- Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268.
- Efficient training of language models to fill in the middle. arXiv preprint arXiv:2207.14255.
- Llm2vec: Large language models are secretly powerful text encoders. arXiv preprint arXiv:2404.05961.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.
- Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR.
- Léon Bottou and Bernhardt Schölkopf. 2023. Borges and ai. arXiv preprint arXiv:2310.01425.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- On the relation between sensitivity and accuracy in in-context learning. Preprint, arXiv:2209.07661.
- FlashAttention: Fast and memory-efficient exact attention with IO-awareness. Advances in Neural Information Processing Systems, 35:16344–16359.
- GPT3.int8(): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems, 35:30318–30332.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- GPTQ: Accurate post-training quantization for generative pre-trained transformers. In The Eleventh International Conference on Learning Representations.
- Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations.
- Gautier Izacard and Edouard Grave. 2021. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 874–880, Online. Association for Computational Linguistics.
- Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- Damjan Kalajdzievski. 2024. Scaling laws for forgetting when fine-tuning large language models. arXiv preprint arXiv:2401.05605.
- Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466.
- Latent retrieval for weakly supervised open domain question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6086–6096.
- Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
- Halueval: A large-scale hallucination evaluation benchmark for large language models. In The 2023 Conference on Empirical Methods in Natural Language Processing.
- Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161.
- Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173.
- Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
- Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173.
- An empirical study of catastrophic forgetting in large language models during continual fine-tuning. arXiv preprint arXiv:2308.08747.
- On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919.
- Sfr-embedding-mistral:enhance text retrieval with transfer learning. Salesforce AI Research Blog.
- Beren Millidge. 2023. Llms confabulate not hallucinate. Accessed on April 3, 2024.
- Generative representational instruction tuning. arXiv preprint arXiv:2402.09906.
- Dynamic memory compression: Retrofitting LLMs for accelerated inference. arXiv preprint arXiv:2403.09636.
- Ms marco: A human generated machine reading comprehension dataset. choice, 2640:660.
- Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473.
- Alexander Peysakhovich and Adam Lerer. 2023. Attention sorting combats recency bias in long context language models. arXiv preprint arXiv:2310.01427.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67.
- Know what you don’t know: Unanswerable questions for squad. In roceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Vol 2: Short Papers), pages 784––789.
- In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics, 11:1316–1331.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Attention is all you need. Advances in neural information processing systems, 30.
- Shall we pretrain autoregressive language models with retrieval? a comprehensive study. arXiv preprint arXiv:2304.06762.
- Retrieval-augmented multilingual knowledge editing. Preprint, arXiv:2312.13040.
- Rome: Robustifying memory-efficient nas via topology disentanglement and gradient accumulation. Preprint, arXiv:2011.11233.
- CodeT5+: Open code large language models for code understanding and generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1069–1088, Singapore. Association for Computational Linguistics.
- In-context learning can re-learn forbidden tasks. arXiv preprint arXiv:2402.05723.
- Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453.
- Hallucination is inevitable: An innate limitation of large language models. arXiv preprint arXiv:2401.11817.
- Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380.
- Counterfactual memorization in neural language models. Preprint, arXiv:2112.12938.
- Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
- Étienne Marcotte (10 papers)
- Pierre-André Noël (22 papers)
- Valentina Zantedeschi (29 papers)
- Nicolas Chapados (25 papers)
- Christopher Pal (97 papers)
- Perouz Taslakian (31 papers)
- João Monteiro (11 papers)
- David Vázquez (10 papers)