Efficient Sparse Attention needs Adaptive Token Release (2407.02328v1)
Abstract: In recent years, LLMs have demonstrated remarkable capabilities across a wide array of text-centric tasks. However, their `large' scale introduces significant computational and storage challenges, particularly in managing the key-value states of the transformer, which limits their wider applicability. Therefore, we propose to adaptively release resources from caches and rebuild the necessary key-value states. Particularly, we accomplish this by a lightweight controller module to approximate an ideal top-$K$ sparse attention. This module retains the tokens with the highest top-$K$ attention weights and simultaneously rebuilds the discarded but necessary tokens, which may become essential for future decoding. Comprehensive experiments in natural language generation and modeling reveal that our method is not only competitive with full attention in terms of performance but also achieves a significant throughput improvement of up to 221.8%. The code for replication is available on the https://github.com/WHUIR/ADORE.
- Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15. IEEE.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.
- GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow. If you use this software, please cite it using these metadata.
- Token merging: Your vit but faster. arXiv preprint arXiv:2210.09461.
- Petals: Collaborative inference and fine-tuning of large models. arXiv preprint arXiv:2209.01188.
- Generating long sequences with sparse transformers. URL https://openai.com/blog/sparse-transformers.
- Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
- Rahul Dey and Fathi M Salem. 2017. Gate-variants of gated recurrent unit (gru) neural networks. In 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS), pages 1597–1600. IEEE.
- Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233.
- SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization, pages 70–79, Hong Kong, China. Association for Computational Linguistics.
- Power-bert: Accelerating bert inference via progressive word-vector elimination. In International Conference on Machine Learning, pages 3690–3699. PMLR.
- Accelerate: Training and inference at scale made simple, efficient and adaptable. https://github.com/huggingface/accelerate.
- Neighborhood attention transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6185–6194.
- Jie Huang and Kevin Chen-Chuan Chang. 2023. Towards reasoning in large language models: A survey. In Findings of the Association for Computational Linguistics: ACL 2023, pages 1049–1065, Toronto, Canada. Association for Computational Linguistics.
- Sparse self-attention transformer for image inpainting. Pattern Recognition, 145:109897.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
- Analysis of deep learning libraries: Keras, pytorch, and mxnet. In 2022 IEEE/ACIS 20th International Conference on Software Engineering Research, Management and Applications (SERA), pages 54–62. IEEE.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
- Strec: Sparse transformer for sequential recommendations. In Proceedings of the 17th ACM Conference on Recommender Systems, RecSys ’23, page 101–111, New York, NY, USA. Association for Computing Machinery.
- How long can context length of open-source llms truly promise? In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following.
- Camel: Communicative agents for" mind" exploration of large scale language model society. arXiv preprint arXiv:2303.17760.
- NarrowBERT: Accelerating masked language model pretraining and inference. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1723–1730, Toronto, Canada. Association for Computational Linguistics.
- Compressing context to enhance inference efficiency of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6342–6353, Singapore. Association for Computational Linguistics.
- Dynamic and efficient inference for text generation via BERT family. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2883–2897, Toronto, Canada. Association for Computational Linguistics.
- Dynast: Dynamic sparse transformer for exemplar-guided image generation. In European Conference on Computer Vision, pages 72–90. Springer.
- Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. arXiv preprint arXiv:2305.17118.
- Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pages 22137–22176. PMLR.
- Whole page unbiased learning to rank. In Proceedings of the ACM on Web Conference 2024, WWW ’24, page 1431–1440, New York, NY, USA. Association for Computing Machinery.
- Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression. arXiv preprint arXiv:2403.12968.
- Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409.
- Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in neural information processing systems, 34:13937–13949.
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506.
- Context compression for auto-regressive transformers with sentinel tokens. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12860–12867, Singapore. Association for Computational Linguistics.
- Finding the SWEET spot: Analysis and improvement of adaptive inference in low resource settings. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14836–14851, Toronto, Canada. Association for Computational Linguistics.
- Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada. Association for Computational Linguistics.
- Flexgen: High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning, pages 31094–31116. PMLR.
- D\\\backslash\’ej\\\backslash\avu: Kv-cache streaming for fast, fault-tolerant generative llm serving. arXiv preprint arXiv:2403.01876.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063.
- Early exiting with ensemble internal classifiers. arXiv preprint arXiv:2105.13792.
- Sparsity-guided holistic explanation for llms with interpretable inference-time intervention. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 21619–21627.
- Towards multi-interest pre-training with sparse capsule network. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’23, page 311–320, New York, NY, USA. Association for Computing Machinery.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Gptvoicetasker: Llm-powered virtual assistant for smartphone. arXiv preprint arXiv:2401.14268.
- Ppt: Token pruning and pooling for efficient vision transformers. arXiv preprint arXiv:2310.01812.
- Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations.
- Predictive pipelined decoding: A compute-latency trade-off for exact llm decoding. arXiv preprint arXiv:2307.05908.
- Orca: A distributed serving system for {{\{{Transformer-Based}}\}} generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538.
- Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297.
- Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
- H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36.
- User retention-oriented recommendation with decision transformer. In Proceedings of the ACM Web Conference 2023, WWW ’23, page 1141–1149, New York, NY, USA. Association for Computing Machinery.