Papers
Topics
Authors
Recent
Search
2000 character limit reached

Attention Is All You Need But You Don't Need All Of It For Inference of Large Language Models

Published 22 Jul 2024 in cs.LG and cs.CL | (2407.15516v1)

Abstract: The inference demand for LLMs has skyrocketed in recent months, and serving models with low latencies remains challenging due to the quadratic input length complexity of the attention layers. In this work, we investigate the effect of dropping MLP and attention layers at inference time on the performance of Llama-v2 models. We find that dropping dreeper attention layers only marginally decreases performance but leads to the best speedups alongside dropping entire layers. For example, removing 33\% of attention layers in a 13B Llama2 model results in a 1.8\% drop in average performance over the OpenLLM benchmark. We also observe that skipping layers except the latter layers reduces performances for more layers skipped, except for skipping the attention layers.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Centered self-attention layers, 2023.
  2. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
  3. Eliciting latent predictions from transformers with the tuned lens. arXiv preprint arXiv:2303.08112, 2023.
  4. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024.
  5. Accelerating large language model decoding with speculative sampling. CoRR, abs/2302.01318, 2023. doi: 10.48550/ARXIV.2302.01318. URL https://doi.org/10.48550/arXiv.2302.01318.
  6. Ee-llm: Large-scale training and inference of early-exit large language models with 3d parallelism, 2024.
  7. Graph convolutions enrich the self-attention in transformers!, 2024.
  8. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018.
  9. Jump to conclusions: Short-cutting transformers with linear transformations. arXiv preprint arXiv:2303.09435, 2023.
  10. Attention is not all you need: Pure attention loses rank doubly exponentially with depth, 2023.
  11. Setting the record straight on transformer oversmoothing, 2024.
  12. Depth-adaptive transformer. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SJg7KhVKPH.
  13. Layer skip: Enabling early exit inference and self-speculative decoding. arXiv preprint arXiv:2404.16710, 2024.
  14. Reducing transformer depth on demand with structured dropout, 2019.
  15. Not all layers of llms are necessary during inference, 2024.
  16. Hungry hungry hippos: Towards language modeling with state space models. arXiv preprint arXiv:2212.14052, 2022.
  17. Efficient training of bert by progressively stacking. In International conference on machine learning, pp. 2337–2346. PMLR, 2019.
  18. Graves, A. Adaptive computation time for recurrent neural networks, 2017.
  19. The unreasonable ineffectiveness of the deeper layers, 2024.
  20. Mamba: Linear-time sequence modeling with selective state spaces, 2023.
  21. Measuring massive multitask language understanding, 2021.
  22. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  23. Learning layer-skippable inference network. IEEE Transactions on Image Processing, 29:8747–8759, 2020. doi: 10.1109/TIP.2020.3018269.
  24. Challenges and applications of large language models. CoRR, abs/2307.10169, 2023a. doi: 10.48550/ARXIV.2307.10169. URL https://doi.org/10.48550/arXiv.2307.10169.
  25. No train no gain: Revisiting efficient training algorithms for transformer-based language models. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023b.
  26. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pp.  611–626, 2023.
  27. Truthfulqa: Measuring how models mimic human falsehoods, 2022.
  28. Deja vu: Contextual sparsity for efficient LLMs at inference time. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  22137–22176. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/liu23am.html.
  29. Shortgpt: Layers in large language models are more redundant than you expect, 2024. URL https://arxiv.org/abs/2403.03853.
  30. Dynamic memory compression: Retrofitting llms for accelerated inference, 2024.
  31. Carbon emissions and large neural network training. CoRR, abs/2104.10350, 2021. URL https://arxiv.org/abs/2104.10350.
  32. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023.
  33. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
  34. On the effect of dropping layers of pre-trained transformer models. Computer Speech & Language, 77:101429, jan 2023. doi: 10.1016/j.csl.2022.101429. URL https://doi.org/10.1016%2Fj.csl.2022.101429.
  35. Confident adaptive language modeling. Advances in Neural Information Processing Systems, 35:17456–17472, 2022.
  36. Branchynet: Fast inference via early exiting from deep neural networks, 2017.
  37. Llama: Open and efficient foundation language models, 2023a.
  38. Llama 2: Open foundation and fine-tuned chat models, 2023b.
  39. Skipbert: Efficient inference with shallow layer skipping. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  7287–7301, 2022a.
  40. Anti-oversmoothing in deep vision transformers via the fourier domain analysis: From theory to practice. In International Conference on Learning Representations, 2022b. URL https://openreview.net/forum?id=O476oWmiNNp.
  41. Layer-condensed kv cache for efficient inference of large language models, 2024.
  42. Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding. arXiv preprint arXiv:2401.07851, 2024.
  43. Hellaswag: Can a machine really finish your sentence?, 2019.
  44. Stabilizing transformer training by preventing attention entropy collapse, 2023.
  45. Accelerating training of transformer-based language models with progressive layer dropping. Advances in neural information processing systems, 33:14011–14023, 2020.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 3 likes about this paper.