FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping (2404.03865v1)
Abstract: Autoregressive LLMs (e.g., LLaMa, GPTs) are omnipresent achieving remarkable success in language understanding and generation. However, such impressive capability typically comes with a substantial model size, which presents significant challenges for autoregressive token-by-token generation. To mitigate computation overload incurred during generation, several early-exit and layer-dropping strategies have been proposed. Despite some promising success due to the redundancy across LLMs layers on metrics like Rough-L/BLUE, our careful knowledge-intensive evaluation unveils issues such as generation collapse, hallucination of wrong facts, and noticeable performance drop even at the trivial exit ratio of 10-15% of layers. We attribute these errors primarily to ineffective handling of the KV cache through state copying during early-exit. In this work, we observed the saturation of computationally expensive feed-forward blocks of LLM layers and proposed FFN-SkipLLM, which is a novel fine-grained skip strategy of autoregressive LLMs. More specifically, FFN-SkipLLM is an input-adaptive feed-forward skipping strategy that can skip 25-30% of FFN blocks of LLMs with marginal change in performance on knowledge-intensive generation tasks without any requirement to handle KV cache. Our extensive experiments and ablation across benchmarks like MT-Bench, Factoid-QA, and variable-length text summarization illustrate how our simple and ease-at-use method can facilitate faster autoregressive decoding.
- Fast and robust early-exiting framework for autoregressive language models with synchronized parallel decoding. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 5910–5924, Singapore, December 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.362. URL https://aclanthology.org/2023.emnlp-main.362.
- Fast and robust early-exiting framework for autoregressive language models with synchronized parallel decoding. arXiv preprint arXiv:2310.05424, 2023b.
- Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
- A thorough examination of the cnn/daily mail reading comprehension task. arXiv preprint arXiv:1606.02858, 2016.
- Llaga: Large language and graph assistant. arXiv preprint arXiv:2402.08170, 2024.
- The lottery ticket hypothesis for pre-trained bert networks. Advances in neural information processing systems, 33:15834–15846, 2020.
- Ee-llm: Large-scale training and inference of early-exit large language models with 3d parallelism. ArXiv, abs/2312.04916, 2023a. URL https://api.semanticscholar.org/CorpusID:266149909.
- Ee-llm: Large-scale training and inference of early-exit large language models with 3d parallelism. arXiv preprint arXiv:2312.04916, 2023b.
- Exploring the potential of large language models (llms) in learning on graphs. arXiv preprint arXiv:2307.03393, 2023c.
- Skipdecode: Autoregressive skip decoding with batching and caching for efficient llm inference. arXiv preprint arXiv:2307.02628, 2023.
- Qlora: Efficient finetuning of quantized llms. ArXiv, abs/2305.14314, 2023a. URL https://api.semanticscholar.org/CorpusID:258841328.
- Spqr: A sparse-quantized representation for near-lossless llm weight compression. ArXiv, abs/2306.03078, 2023b. URL https://api.semanticscholar.org/CorpusID:259076379.
- Simteg: A frustratingly simple approach improves textual graph learning. arXiv preprint arXiv:2308.02565, 2023.
- Depth-adaptive transformer. arXiv preprint arXiv:1910.10073, 2019.
- Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model. arXiv preprint arXiv:1906.01749, 2019.
- The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJl-b3RcF7.
- Gptq: Accurate post-training quantization for generative pre-trained transformers. ArXiv, abs/2210.17323, 2022. URL https://api.semanticscholar.org/CorpusID:253237200.
- Dynabert: dynamic bert with adaptive width and depth. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.
- A neural network for factoid question answering over paragraphs. In Conference on Empirical Methods in Natural Language Processing, 2014. URL https://api.semanticscholar.org/CorpusID:216034672.
- Multi-dimensional evaluation of text summarization with in-context learning. arXiv preprint arXiv:2306.01200, 2023.
- Radbert-cl: Factually-aware contrastive learning for radiology report classification. In Machine Learning for Health, pp. 196–208. PMLR, 2021.
- The emergence of essential sparsity in large pre-trained models: The weights that matter. arXiv preprint arXiv:2306.03805, 2023a.
- Training your sparse neural network better with any mask. In International Conference on Machine Learning, pp. 9833–9844. PMLR, 2022.
- Instant soup: Cheap pruning ensembles in a single pass can draw lottery tickets from large models. In International Conference on Machine Learning, pp. 14691–14701. PMLR, 2023b.
- Freebaseqa: A new factoid qa data set matching trivia-style question-answer pairs with freebase. In North American Chapter of the Association for Computational Linguistics, 2019. URL https://api.semanticscholar.org/CorpusID:174800890.
- Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017.
- Shortened llama: A simple depth pruning for large language models. arXiv preprint arXiv:2402.02834, 2024.
- Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization. ArXiv, abs/2305.14152, 2023. URL https://api.semanticscholar.org/CorpusID:258841104.
- Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
- Snip: Single-shot network pruning based on connection sensitivity. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=B1VZqjAcYX.
- Can large language models infer and disagree like humans? ArXiv, abs/2305.13788, 2023. URL https://api.semanticscholar.org/CorpusID:258841424.
- CascadeBERT: Accelerating inference of pre-trained language models via calibrated complete models cascade. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 475–486, Punta Cana, Dominican Republic, November 2021a. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-emnlp.43. URL https://aclanthology.org/2021.findings-emnlp.43.
- Cancergpt for few shot drug pair synergy prediction using large pretrained language models. npj Digital Medicine, 7(1):40, 2024.
- Accelerating bert inference for sequence labeling via early-exit. arXiv preprint arXiv:2105.13878, 2021b.
- Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models. arXiv preprint arXiv:2305.13655, 2023.
- Awq: Activation-aware weight quantization for llm compression and acceleration. ArXiv, abs/2306.00978, 2023. URL https://api.semanticscholar.org/CorpusID:258999941.
- Llmrec: Benchmarking large language models on recommendation task. arXiv preprint arXiv:2308.12241, 2023a.
- Sparsity may cry: Let us fail (current) sparse neural networks together! arXiv preprint arXiv:2303.02141, 2023b.
- FastBERT: a self-distilling BERT with adaptive inference time. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6035–6044, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.537. URL https://aclanthology.org/2020.acl-main.537.
- Llm-qat: Data-free quantization aware training for large language models. arXiv preprint arXiv:2305.17888, 2023c.
- Chameleon: Plug-and-play compositional reasoning with large language models. arXiv preprint arXiv:2304.09842, 2023.
- Shortgpt: Layers in large language models are more redundant than you expect. arXiv preprint arXiv:2403.03853, 2024.
- Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023, 2016.
- Can large language models empower molecular property prediction? arXiv preprint arXiv:2307.07443, 2023.
- Is chatgpt a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476, 2023.
- Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
- In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083, 2023.
- Arb: Advanced reasoning benchmark for large language models. arXiv preprint arXiv:2307.13692, 2023.
- Confident adaptive language modeling, 2022.
- High-throughput generative inference of large language models with a single gpu. arXiv preprint arXiv:2303.06865, 2023.
- A simple hash-based early exiting approach for language understanding and generation, 2022.
- Deed: Dynamic early exit on decoder for accelerating encoder-decoder transformer models, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175, 2023.
- Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023.
- DeeBERT: Dynamic early exiting for accelerating BERT inference. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2246–2251, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.204. URL https://aclanthology.org/2020.acl-main.204.
- Natural language is all a graph needs. arXiv preprint arXiv:2308.07134, 2023.
- Pruning small pre-trained weights irreversibly and monotonically impairs ”difficult” downstream tasks in llms. 2023a. URL https://api.semanticscholar.org/CorpusID:263620664.
- Outlier weighed layerwise sparsity (owl): A missing secret sauce for pruning llms to high sparsity. arXiv preprint arXiv:2310.05175, 2023b.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- Sparse cocktail: Every sparse pattern every sparse ratio all at once. 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
- Wei Zhu. LeeBERT: Learned early exit for BERT with cross-level optimization. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 2968–2980, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.231. URL https://aclanthology.org/2021.acl-long.231.
- Terry Yue Zhuo. Large language models are state-of-the-art evaluators of code generation. arXiv preprint arXiv:2304.14317, 2023.
- Ajay Jaiswal (35 papers)
- Bodun Hu (6 papers)
- Lu Yin (85 papers)
- Yeonju Ro (4 papers)
- Shiwei Liu (76 papers)
- Tianlong Chen (202 papers)
- Aditya Akella (44 papers)