Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping (2404.03865v1)

Published 5 Apr 2024 in cs.CL and cs.LG

Abstract: Autoregressive LLMs (e.g., LLaMa, GPTs) are omnipresent achieving remarkable success in language understanding and generation. However, such impressive capability typically comes with a substantial model size, which presents significant challenges for autoregressive token-by-token generation. To mitigate computation overload incurred during generation, several early-exit and layer-dropping strategies have been proposed. Despite some promising success due to the redundancy across LLMs layers on metrics like Rough-L/BLUE, our careful knowledge-intensive evaluation unveils issues such as generation collapse, hallucination of wrong facts, and noticeable performance drop even at the trivial exit ratio of 10-15% of layers. We attribute these errors primarily to ineffective handling of the KV cache through state copying during early-exit. In this work, we observed the saturation of computationally expensive feed-forward blocks of LLM layers and proposed FFN-SkipLLM, which is a novel fine-grained skip strategy of autoregressive LLMs. More specifically, FFN-SkipLLM is an input-adaptive feed-forward skipping strategy that can skip 25-30% of FFN blocks of LLMs with marginal change in performance on knowledge-intensive generation tasks without any requirement to handle KV cache. Our extensive experiments and ablation across benchmarks like MT-Bench, Factoid-QA, and variable-length text summarization illustrate how our simple and ease-at-use method can facilitate faster autoregressive decoding.

FFN-SkipLLM: Adaptive Feed-Forward Skipping Strategy for Enhanced Autoregressive Decoding in LLMs

Introduction

The exponential growth in the capabilities of Autoregressive LLMs has been met with increasing challenges related to their deployment due to the substantial computational demands these models entail. While several strategies focusing on early exits and layer dropping have been proposed to mitigate this, they often encounter limitations, such as generation collapse and hallucination issues, due to ineffective handling of the Key-Value (KV) cache. This paper introduces FFN-SkipLLM, a novel strategy that targets the computationally expensive Feed-Forward Network (FFN) blocks within LLMs' layers. By allowing for a fine-grained and input-adaptive skipping of approximately 25-30% of FFN blocks, FFN-SkipLLM achieves marginal performance changes on knowledge-intensive generation tasks without encountering the KV cache issues that hamper existing approaches.

Motivation

The observation that motivates this work is two-fold: First, a significant redundancy exists in the computation performed by FFN blocks within LLMs, particularly in the middle layers. Second, leveraging the "attention sink" phenomenon, whereby early tokens in a sequence disproportionately influence model output, allows for a portion of the model's computation to be bypassed without substantially degrading performance. This approach proposes a departure from traditional layer-skipping methodologies by focusing on FFN block skipping, thereby circumventing the complexities related to KV cache handling.

FFN-SkipLLM: An Approach to FFN Block Skipping

Preliminaries

Analysis reveals that FFN blocks, which constitute approximately two-thirds of the parameters in a given layer (as demonstrated in LLaMa-7B layers), exhibit a high degree of computational redundancy. This redundancy is primarily observed in the middle layers of LLMs, with cosines similarity analyses indicating that tensors before and after FFN blocks undergo minimal change. Consequently, FFN blocks within these "non-cold" regions emerge as prime candidates for skipping, promising substantial computational savings with negligible impact on output quality.

Methodology

FFN-SkipLLM employs a dynamic strategy that adapts FFN block skipping according to input-specific characteristics. This strategy is detailed in an algorithm that selectively bypasses FFN blocks within non-cold regions based on the cosine similarity between input and output tensors of these blocks. By maintaining the computation in the initial and final layers (cold regions) and employing a warm-up mechanism that temporarily foregoes skipping for the initial tokens, FFN-SkipLLM preserves the integrity of the KV cache and ensures a stable generation process.

Experimental Evaluation

Extensive experiments across benchmarks such as MT-Bench, Factoid-QA, and variable-length text summarization demonstrate the efficacy of FFN-SkipLLM. Notably, the model can skip a significant portion of FFN blocks while retaining nearly full model performance across a range of knowledge-intensive tasks. This capability starkly contrasts with the performance drops and inaccuracies observed in existing layer-skipping approaches, affirming the potential of FFN-SkipLLM as a more robust and efficient alternative.

Implications and Future Directions

The introduction of FFN-SkipLLM opens up new avenues for enhancing the performance and efficiency of autoregressive LLMs. By sidestepping the challenges associated with KV cache management inherent in layer-skipping strategies, this approach paves the way for more sustainable and accessible deployment of LLMs across various applications. Moving forward, integrating FFN-SkipLLM with other model compression techniques, such as sparsity and quantization, may yield further improvements in computational efficiency. Additionally, addressing the current limitations related to the scaling of skip ratios beyond 35% without performance degradation remains an area ripe for future research.

Conclusion

FFN-SkipLLM represents a significant stride toward mitigating the computational demands of deploying state-of-the-art autoregressive LLMs. By leveraging insights into the redundancy of FFN blocks and the strategic skipping of these components, this approach achieves a delicate balance between computational efficiency and model performance, heralding a new era of more accessible and performant LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Fast and robust early-exiting framework for autoregressive language models with synchronized parallel decoding. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  5910–5924, Singapore, December 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.362. URL https://aclanthology.org/2023.emnlp-main.362.
  2. Fast and robust early-exiting framework for autoregressive language models with synchronized parallel decoding. arXiv preprint arXiv:2310.05424, 2023b.
  3. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  4. A thorough examination of the cnn/daily mail reading comprehension task. arXiv preprint arXiv:1606.02858, 2016.
  5. Llaga: Large language and graph assistant. arXiv preprint arXiv:2402.08170, 2024.
  6. The lottery ticket hypothesis for pre-trained bert networks. Advances in neural information processing systems, 33:15834–15846, 2020.
  7. Ee-llm: Large-scale training and inference of early-exit large language models with 3d parallelism. ArXiv, abs/2312.04916, 2023a. URL https://api.semanticscholar.org/CorpusID:266149909.
  8. Ee-llm: Large-scale training and inference of early-exit large language models with 3d parallelism. arXiv preprint arXiv:2312.04916, 2023b.
  9. Exploring the potential of large language models (llms) in learning on graphs. arXiv preprint arXiv:2307.03393, 2023c.
  10. Skipdecode: Autoregressive skip decoding with batching and caching for efficient llm inference. arXiv preprint arXiv:2307.02628, 2023.
  11. Qlora: Efficient finetuning of quantized llms. ArXiv, abs/2305.14314, 2023a. URL https://api.semanticscholar.org/CorpusID:258841328.
  12. Spqr: A sparse-quantized representation for near-lossless llm weight compression. ArXiv, abs/2306.03078, 2023b. URL https://api.semanticscholar.org/CorpusID:259076379.
  13. Simteg: A frustratingly simple approach improves textual graph learning. arXiv preprint arXiv:2308.02565, 2023.
  14. Depth-adaptive transformer. arXiv preprint arXiv:1910.10073, 2019.
  15. Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model. arXiv preprint arXiv:1906.01749, 2019.
  16. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJl-b3RcF7.
  17. Gptq: Accurate post-training quantization for generative pre-trained transformers. ArXiv, abs/2210.17323, 2022. URL https://api.semanticscholar.org/CorpusID:253237200.
  18. Dynabert: dynamic bert with adaptive width and depth. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.
  19. A neural network for factoid question answering over paragraphs. In Conference on Empirical Methods in Natural Language Processing, 2014. URL https://api.semanticscholar.org/CorpusID:216034672.
  20. Multi-dimensional evaluation of text summarization with in-context learning. arXiv preprint arXiv:2306.01200, 2023.
  21. Radbert-cl: Factually-aware contrastive learning for radiology report classification. In Machine Learning for Health, pp.  196–208. PMLR, 2021.
  22. The emergence of essential sparsity in large pre-trained models: The weights that matter. arXiv preprint arXiv:2306.03805, 2023a.
  23. Training your sparse neural network better with any mask. In International Conference on Machine Learning, pp.  9833–9844. PMLR, 2022.
  24. Instant soup: Cheap pruning ensembles in a single pass can draw lottery tickets from large models. In International Conference on Machine Learning, pp.  14691–14701. PMLR, 2023b.
  25. Freebaseqa: A new factoid qa data set matching trivia-style question-answer pairs with freebase. In North American Chapter of the Association for Computational Linguistics, 2019. URL https://api.semanticscholar.org/CorpusID:174800890.
  26. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017.
  27. Shortened llama: A simple depth pruning for large language models. arXiv preprint arXiv:2402.02834, 2024.
  28. Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization. ArXiv, abs/2305.14152, 2023. URL https://api.semanticscholar.org/CorpusID:258841104.
  29. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
  30. Snip: Single-shot network pruning based on connection sensitivity. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=B1VZqjAcYX.
  31. Can large language models infer and disagree like humans? ArXiv, abs/2305.13788, 2023. URL https://api.semanticscholar.org/CorpusID:258841424.
  32. CascadeBERT: Accelerating inference of pre-trained language models via calibrated complete models cascade. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Findings of the Association for Computational Linguistics: EMNLP 2021, pp.  475–486, Punta Cana, Dominican Republic, November 2021a. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-emnlp.43. URL https://aclanthology.org/2021.findings-emnlp.43.
  33. Cancergpt for few shot drug pair synergy prediction using large pretrained language models. npj Digital Medicine, 7(1):40, 2024.
  34. Accelerating bert inference for sequence labeling via early-exit. arXiv preprint arXiv:2105.13878, 2021b.
  35. Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models. arXiv preprint arXiv:2305.13655, 2023.
  36. Awq: Activation-aware weight quantization for llm compression and acceleration. ArXiv, abs/2306.00978, 2023. URL https://api.semanticscholar.org/CorpusID:258999941.
  37. Llmrec: Benchmarking large language models on recommendation task. arXiv preprint arXiv:2308.12241, 2023a.
  38. Sparsity may cry: Let us fail (current) sparse neural networks together! arXiv preprint arXiv:2303.02141, 2023b.
  39. FastBERT: a self-distilling BERT with adaptive inference time. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  6035–6044, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.537. URL https://aclanthology.org/2020.acl-main.537.
  40. Llm-qat: Data-free quantization aware training for large language models. arXiv preprint arXiv:2305.17888, 2023c.
  41. Chameleon: Plug-and-play compositional reasoning with large language models. arXiv preprint arXiv:2304.09842, 2023.
  42. Shortgpt: Layers in large language models are more redundant than you expect. arXiv preprint arXiv:2403.03853, 2024.
  43. Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023, 2016.
  44. Can large language models empower molecular property prediction? arXiv preprint arXiv:2307.07443, 2023.
  45. Is chatgpt a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476, 2023.
  46. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
  47. In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083, 2023.
  48. Arb: Advanced reasoning benchmark for large language models. arXiv preprint arXiv:2307.13692, 2023.
  49. Confident adaptive language modeling, 2022.
  50. High-throughput generative inference of large language models with a single gpu. arXiv preprint arXiv:2303.06865, 2023.
  51. A simple hash-based early exiting approach for language understanding and generation, 2022.
  52. Deed: Dynamic early exit on decoder for accelerating encoder-decoder transformer models, 2023.
  53. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  54. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175, 2023.
  55. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023.
  56. DeeBERT: Dynamic early exiting for accelerating BERT inference. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  2246–2251, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.204. URL https://aclanthology.org/2020.acl-main.204.
  57. Natural language is all a graph needs. arXiv preprint arXiv:2308.07134, 2023.
  58. Pruning small pre-trained weights irreversibly and monotonically impairs ”difficult” downstream tasks in llms. 2023a. URL https://api.semanticscholar.org/CorpusID:263620664.
  59. Outlier weighed layerwise sparsity (owl): A missing secret sauce for pruning llms to high sparsity. arXiv preprint arXiv:2310.05175, 2023b.
  60. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  61. Sparse cocktail: Every sparse pattern every sparse ratio all at once. 2023.
  62. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
  63. Wei Zhu. LeeBERT: Learned early exit for BERT with cross-level optimization. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  2968–2980, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.231. URL https://aclanthology.org/2021.acl-long.231.
  64. Terry Yue Zhuo. Large language models are state-of-the-art evaluators of code generation. arXiv preprint arXiv:2304.14317, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Ajay Jaiswal (35 papers)
  2. Bodun Hu (6 papers)
  3. Lu Yin (85 papers)
  4. Yeonju Ro (4 papers)
  5. Shiwei Liu (76 papers)
  6. Tianlong Chen (202 papers)
  7. Aditya Akella (44 papers)
Citations (8)
X Twitter Logo Streamline Icon: https://streamlinehq.com