Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A2SF: Accumulative Attention Scoring with Forgetting Factor for Token Pruning in Transformer Decoder (2407.20485v2)

Published 30 Jul 2024 in cs.CL and cs.LG

Abstract: Recently, LLMs (LLM) based on transformers are facing memory bottleneck issues due to KV cache, especially in long sequence handling. Previous researches proposed KV cache compression techniques that identify insignificant tokens based on Accumulative Attention Scores and removes their items from KV cache, noting that only few tokens play an important role in attention operations. However, we have observed that the existing Accumulative Attention Score is not suitable for the transformer decoder structure. In the decoder model, the number of times the Attention Score accumulates varies depending on the order of token appearance due to the effect of masking, causing an uneven comparison between tokens. To solve this, we propose Accumulative Attention Score with Forgetting Factor (A2SF) technique, which introduces a Forgetting Factor in the Attention Score accumulation process. A2SF applies a penalty to the past Attention Score generated from old tokens by repeatedly multiplying the Forgetting Factor to the Attention Score over time. Therefore, older tokens receive a larger penalty, providing fairness among different ages of tokens. Through the fair comparison among tokens, we can more effectively select important tokens. We have verified the accuracy improvement through A2SF in the OPT and LLaMA models and A2SF improves the accuracy of LLaMA 2 by up to 7.8% and 5.1% on 1-shot and 0-shot.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  2. Flexgen: High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning, pages 31094–31116. PMLR, 2023.
  3. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023.
  4. Accelerate: Training and inference at scale made simple, efficient and adaptable. https://github.com/huggingface/accelerate, 2022.
  5. Spatten: Efficient sparse attention architecture with cascade token and head pruning. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 97–110. IEEE, 2021.
  6. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36, 2024.
  7. No token left behind: Explainability-aided image classification and generation. In European Conference on Computer Vision, pages 334–350. Springer, 2022.
  8. Get more with less: Synthesizing recurrence with kv cache compression for efficient llm inference. arXiv preprint arXiv:2402.09398, 2024.
  9. Keyformer: Kv cache reduction through key tokens selection for efficient generative inference. arXiv preprint arXiv:2403.09054, 2024.
  10. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  11. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  12. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  13. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
  14. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023.
  15. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
  16. Dynamic context pruning for efficient and interpretable autoregressive transformers. Advances in Neural Information Processing Systems, 36, 2024.
  17. Dynamic memory compression: Retrofitting llms for accelerated inference. arXiv preprint arXiv:2403.09636, 2024.
  18. Two components of long-term memory. Acta neurobiologiae experimentalis, 55(4):301–305, 1995.
  19. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  20. Llama: Open and efficient foundation language models, 2023.
  21. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  22. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, 2018.
  23. Winogrande: An adversarial winograd schema challenge at scale. 2019.
  24. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020.
  25. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, 2011.
  26. MathQA: Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2357–2367, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
  27. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018.
  28. A framework for few-shot language model evaluation, 12 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Hyun-rae Jo (1 paper)
  2. Dongkun Shin (4 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets