Not All Layers of LLMs Are Necessary During Inference (2403.02181v3)
Abstract: Due to the large number of parameters, the inference phase of LLMs is resource-intensive. However, not all requests posed to LLMs are equally difficult to handle. Through analysis, we show that for some tasks, LLMs can achieve results comparable to the final output at some intermediate layers. That is, not all layers of LLMs are necessary during inference. If we can predict at which layer the inferred results match the final results (produced by evaluating all layers), we could significantly reduce the inference cost. To this end, we propose a simple yet effective algorithm named AdaInfer to adaptively terminate the inference process for an input instance. AdaInfer relies on easily obtainable statistical features and classic classifiers like SVM. Experiments on well-known LLMs like the Llama2 series and OPT, show that AdaInfer can achieve an average of 17.8% pruning ratio, and up to 43% on sentiment tasks, with nearly no performance drop (<1%). Because AdaInfer does not alter LLM parameters, the LLMs incorporated with AdaInfer maintain generalizability across tasks.
- Batch-shaping for learning conditional channel gated networks. arXiv preprint arXiv:1907.06627.
- Adaptive neural networks for efficient inference. In International Conference on Machine Learning, pages 527–536. PMLR.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Data distributional properties drive emergent in-context learning in transformers. Advances in Neural Information Processing Systems, 35:18878–18891.
- Reaction times and intelligence differences: A population-based cohort study. Intelligence, 29(5):389–399.
- Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323.
- Model editing can hurt general abilities of large language models. arXiv preprint arXiv:2401.04700.
- Dynamic neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7436–7456.
- Support vector machines. IEEE Intelligent Systems and their applications, 13(4):18–28.
- Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR).
- Linearity of relation decoding in transformer language models. In Proceedings of the 2024 International Conference on Learning Representations.
- Multi-scale dense networks for resource efficient image classification. arXiv preprint arXiv:1703.09844.
- David H Hubel and Torsten N Wiesel. 1962. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of physiology, 160(1):106.
- In-context learning in large language models learns label relationships but is not conventional learning. arXiv preprint arXiv:2307.12375.
- Conditional random fields: Probabilistic models for segmenting and labeling sequence data.
- Optimal brain damage. Advances in neural information processing systems, 2.
- FLM-101B: an open LLM and how to train it with $100k budget. CoRR, abs/2309.03852.
- Runtime neural pruning. Advances in neural information processing systems, 30.
- Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270.
- Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning, pages 22137–22176. PMLR.
- Selectivity for the shape, size, and orientation of objects for grasping in neurons of monkey parietal area aip. Journal of neurophysiology, 83(5):2580–2601.
- Efficient large-scale language model training on GPU clusters using megatron-lm. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2021, St. Louis, Missouri, USA, November 14-19, 2021, page 58. ACM.
- Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5.
- SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv e-prints, page arXiv:1606.05250.
- Timothy A Salthouse. 1996. The processing-speed theory of adult age differences in cognition. Psychological review, 103(3):403.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA.
- CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota.
- Branchynet: Fast inference via early exiting from deep neural networks. In 2016 23rd international conference on pattern recognition (ICPR), pages 2464–2469. IEEE.
- Function vectors in large language models. In Proceedings of the 2024 International Conference on Learning Representations.
- Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008.
- Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax.
- Label words are anchors: An information flow perspective for understanding in-context learning. arXiv preprint arXiv:2305.14160.
- Skipnet: Learning dynamic routing in convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 409–424.
- CORT: A new baseline for comparative opinion classification by dual prompts. In Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 7064–7075.
- Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR.
- Resolution adaptive networks for efficient inference. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2369–2378.
- Editing large language models: Problems, methods, and opportunities. arXiv preprint arXiv:2305.13172.
- Learning to skip for language modeling. arXiv preprint arXiv:2311.15436.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
- Character-level convolutional networks for text classification. Advances in neural information processing systems, 28.
- Siqi Fan (31 papers)
- Xin Jiang (242 papers)
- Xiang Li (1003 papers)
- Xuying Meng (18 papers)
- Peng Han (37 papers)
- Shuo Shang (30 papers)
- Aixin Sun (99 papers)
- Yequan Wang (44 papers)
- Zhongyuan Wang (105 papers)