Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference (2404.15420v3)

Published 23 Apr 2024 in cs.CL and cs.AI

Abstract: In-context learning (ICL) approaches typically leverage prompting to condition decoder-only LLM generation on reference information. Just-in-time processing of a context is inefficient due to the quadratic cost of self-attention operations, and caching is desirable. However, caching transformer states can easily require almost as much space as the model parameters. When the right context isn't known in advance, caching ICL can be challenging. This work addresses these limitations by introducing models that, inspired by the encoder-decoder architecture, use cross-attention to condition generation on reference text without the prompt. More precisely, we leverage pre-trained decoder-only models and only train a small number of added layers. We use Question-Answering (QA) as a testbed to evaluate the ability of our models to perform conditional generation and observe that they outperform ICL, are comparable to fine-tuned prompted LLMs, and drastically reduce the space footprint relative to standard KV caching by two orders of magnitude.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Evaluating correctness and faithfulness of instruction-following models for question answering. arXiv preprint arXiv:2307.16877.
  2. Topiocqa: Open-domain conversational question answering with topic switching. Transactions of the Association for Computational Linguistics, 10:468–483.
  3. Keyformer: Kv cache reduction through key tokens selection for efficient generative inference. arXiv preprint arXiv:2403.09054.
  4. GQA: Training generalized multi-query transformer models from multi-head checkpoints. In The 2023 Conference on Empirical Methods in Natural Language Processing.
  5. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268.
  6. Efficient training of language models to fill in the middle. arXiv preprint arXiv:2207.14255.
  7. Llm2vec: Large language models are secretly powerful text encoders. arXiv preprint arXiv:2404.05961.
  8. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.
  9. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR.
  10. Léon Bottou and Bernhardt Schölkopf. 2023. Borges and ai. arXiv preprint arXiv:2310.01425.
  11. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  12. On the relation between sensitivity and accuracy in in-context learning. Preprint, arXiv:2209.07661.
  13. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. Advances in Neural Information Processing Systems, 35:16344–16359.
  14. GPT3.int8(): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems, 35:30318–30332.
  15. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  16. GPTQ: Accurate post-training quantization for generative pre-trained transformers. In The Eleventh International Conference on Learning Representations.
  17. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  18. Gautier Izacard and Edouard Grave. 2021. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 874–880, Online. Association for Computational Linguistics.
  19. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
  20. Mistral 7b. arXiv preprint arXiv:2310.06825.
  21. Damjan Kalajdzievski. 2024. Scaling laws for forgetting when fine-tuning large language models. arXiv preprint arXiv:2401.05605.
  22. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466.
  23. Latent retrieval for weakly supervised open domain question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6086–6096.
  24. Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
  25. Halueval: A large-scale hallucination evaluation benchmark for large language models. In The 2023 Conference on Empirical Methods in Natural Language Processing.
  26. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161.
  27. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173.
  28. Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
  29. Starcoder 2 and the stack v2: The next generation. arXiv preprint arXiv:2402.19173.
  30. An empirical study of catastrophic forgetting in large language models during continual fine-tuning. arXiv preprint arXiv:2308.08747.
  31. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919.
  32. Sfr-embedding-mistral:enhance text retrieval with transfer learning. Salesforce AI Research Blog.
  33. Beren Millidge. 2023. Llms confabulate not hallucinate. Accessed on April 3, 2024.
  34. Generative representational instruction tuning. arXiv preprint arXiv:2402.09906.
  35. Dynamic memory compression: Retrofitting LLMs for accelerated inference. arXiv preprint arXiv:2403.09636.
  36. Ms marco: A human generated machine reading comprehension dataset. choice, 2640:660.
  37. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473.
  38. Alexander Peysakhovich and Adam Lerer. 2023. Attention sorting combats recency bias in long context language models. arXiv preprint arXiv:2310.01427.
  39. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  40. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67.
  41. Know what you don’t know: Unanswerable questions for squad. In roceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Vol 2: Short Papers), pages 784––789.
  42. In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics, 11:1316–1331.
  43. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  44. Attention is all you need. Advances in neural information processing systems, 30.
  45. Shall we pretrain autoregressive language models with retrieval? a comprehensive study. arXiv preprint arXiv:2304.06762.
  46. Retrieval-augmented multilingual knowledge editing. Preprint, arXiv:2312.13040.
  47. Rome: Robustifying memory-efficient nas via topology disentanglement and gradient accumulation. Preprint, arXiv:2011.11233.
  48. CodeT5+: Open code large language models for code understanding and generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1069–1088, Singapore. Association for Computational Linguistics.
  49. In-context learning can re-learn forbidden tasks. arXiv preprint arXiv:2402.05723.
  50. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453.
  51. Hallucination is inevitable: An innate limitation of large language models. arXiv preprint arXiv:2401.11817.
  52. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380.
  53. Counterfactual memorization in neural language models. Preprint, arXiv:2112.12938.
  54. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
  55. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Étienne Marcotte (10 papers)
  2. Pierre-André Noël (22 papers)
  3. Valentina Zantedeschi (29 papers)
  4. Nicolas Chapados (25 papers)
  5. Christopher Pal (97 papers)
  6. Perouz Taslakian (31 papers)
  7. João Monteiro (11 papers)
  8. David Vázquez (10 papers)
Citations (4)

Summary

  • The paper presents a novel caching mechanism that reduces cache size by up to 98% through cross-attention layers.
  • It utilizes an encoder-decoder framework to bypass full prompt reliance and streamline LLM inference.
  • Empirical evaluations demonstrate competitive QA performance with significantly lower computational overhead.

Exploring Efficient Caching Mechanisms for LLM Inference in XC-C ACHE

Introduction to XC-C ACHE

The research introduces XC-C ACHE, a novel caching approach targeting efficient LLM inference. Recognizing the inefficient quadratic cost associated with standard In-Context Learning (ICL) due to extensive self-attention operations, this work proposes leveraging an encoder-decoder architecture without the necessity of a prompt. By employing decoder-only LLMs and introducing minor cross-attending layers, XC-C ACHE tackles high computational and space costs linked with traditional KV caching and offers a streamlined alternative that drastically reduces space requirements while maintaining competitive performance.

Caching and Inference Efficiency

Caching mechanisms, integral to managing extensive computation costs in LLMs, must balance space consumption and processing efficiency. XC-C ACHE leverages a cross-context-cache methodology, greatly reducing the memory footprint required per token of context:

  • ICL and KV Caching Issues: Standard KV caching mimics full model computation storage, resulting in significant space inefficiencies.
  • Proposed XC Caching Mechanism: Introduces a lightweight caching approach that stores only necessary encoder outputs instead of the entire set of intermediate decoder states. XC Caching allows for a direct reduction in cache size up to 98%, proving its efficacy over conventional methods.

Model Architectures and Training

Two distinct architectures are presented for testing XC-C ACHE’s effectiveness:

  1. XC-L LAMA: Utilizes a minimal set of cross-attention layers added to a pre-trained decoder.
  2. XC-L LAMA ENC: Pairs a lightweight bi-directional encoder with a frozen pre-trained decoder, potentially improving context processing time when such pre-processing is feasible.

The models undergo training on a question-answering dataset aimed to evaluate differential data conditioning without explicit prompts. Unique multitasking training strategies—including context repetition tasks—are employed to enhance model robustness and data handling capacity.

Performance Evaluation

Performance assessments reveal that models implementing XC-C ACHE principles perform on par with, if not slightly better than, their ICL counterparts:

  • Numerical Results: Demonstrates a substantial reduction in necessary cache size without substantially compromising the model's accuracy.
  • Comparative Analysis: Models trained using XC-C ACHE frameworks show competitive QA performance against other leading LLM setups, further attested by their F1 and BERTScore evaluations on diverse QA tasks.

Theoretical and Practical Implications

The XC-C ACHE methodology promotes a conceptual shift in large-scale model deployment, focusing on efficiency without losing the quality of output:

  • Cache Efficiency vs. Model Performance: The trade-off between reduced cache size and maintained performance provides practical benefits, particularly in environments where resource constraints are paramount.
  • Future Research Directions: Includes recommendations for integrating XC-C ACHE with other model compression and optimization techniques to further enhance inference speed and reduce computational demands.

Overall, XC-C ACHE emerges as an innovative solution to the inefficiencies observed in traditional LLM inference, offering a viable pathway to reducing operational costs and computational overhead in deploying advanced LLMs in real-world applications.

Youtube Logo Streamline Icon: https://streamlinehq.com