Accelerating Inference in Large Language Models with a Unified Layer Skipping Strategy (2404.06954v1)
Abstract: Recently, dynamic computation methods have shown notable acceleration for LLMs by skipping several layers of computations through elaborate heuristics or additional predictors. However, in the decoding process of existing approaches, different samples are assigned different computational budgets, which cannot guarantee a stable and precise acceleration effect. Furthermore, existing approaches generally skip multiple contiguous layers at the bottom or top of the layers, leading to a drastic change in the model's layer-wise representations, and thus a consequent performance degeneration. Therefore, we propose a Unified Layer Skipping strategy, which selects the number of layers to skip computation based solely on the target speedup ratio, and then skips the corresponding number of intermediate layer computations in a balanced manner. Since the Unified Layer Skipping strategy is independent of input samples, it naturally supports popular acceleration techniques such as batch decoding and KV caching, thus demonstrating more practicality for real-world applications. Experimental results on two common tasks, i.e., machine translation and text summarization, indicate that given a target speedup ratio, the Unified Layer Skipping strategy significantly enhances both the inference performance and the actual model throughput over existing dynamic approaches.
- Using large language models to simulate multiple humans and replicate human subject studies. In International Conference on Machine Learning, pages 337–371. PMLR.
- Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691.
- Ömer Aydın and Enis Karaarslan. 2023. Is chatgpt leading generative ai? what is beyond expectations? What is beyond expectations.
- Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Universal transformers. In ICLR.
- Skipdecode: Autoregressive skip decoding with batching and caching for efficient llm inference. arXiv preprint arXiv:2307.02628.
- Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339.
- Depth-adaptive transformer. arXiv preprint arXiv:1910.10073.
- Reducing transformer depth on demand with structured dropout. In ICLR 2020.
- Spatially adaptive computation time for residual networks. In CVPR.
- Probabilistic adaptive computation time. arXiv preprint arXiv:1712.00386.
- Alex Graves. 2016. Adaptive computation time for recurrent neural networks. arXiv.
- A chat (gpt) about the future of scientific publishing. Brain Behav Immun, 110:152–154.
- Parrot: Translating during chat using large language models. In ArXiv.
- Is chatgpt a good translator? a preliminary study. arXiv preprint arXiv:2301.08745.
- Chatgpt for good? on opportunities and challenges of large language models for education. Learning and Individual Differences, 103:102274.
- Prompt distillation for efficient llm-based recommendation. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 1348–1357.
- Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978.
- Text generation with diffusion language models: A pre-training approach with continuous paragraph denoise. In International Conference on Machine Learning, pages 21051–21064. PMLR.
- Scheduled sampling based on decoding steps for neural machine translation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3285–3296.
- Faster depth-adaptive transformers. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 13424–13432.
- Instruction position matters in sequence generation with large language models. arXiv preprint arXiv:2308.12097.
- Llm-qat: Data-free quantization aware training for large language models. arXiv preprint arXiv:2305.17888.
- Anytime dense prediction with confidence adaptivity. arXiv preprint arXiv:2104.00749.
- Llm-pruner: On the structural pruning of large language models. arXiv preprint arXiv:2305.11627.
- Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786.
- Ctr-bert: Cost-effective knowledge distillation for billion-parameter teacher models. In NeurIPS Efficient Natural Language and Speech Processing Workshop.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- ProphetNet: Predicting future n-gram for sequence-to-SequencePre-training. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2401–2410, Online. Association for Computational Linguistics.
- Confident adaptive language modeling. In Advances in Neural Information Processing Systems, volume 35, pages 17456–17472. Curran Associates, Inc.
- Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada. Association for Computational Linguistics.
- Jan Šlapeta. 2023. Are chatgpt and other pretrained language models good parasitologists? Trends in Parasitology.
- Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958.
- A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695.
- Parasum: Contrastive paraphrasing for low-resource extractive text summarization. In International Conference on Knowledge Science, Engineering and Management, pages 106–119. Springer.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Branchynet: Fast inference via early exiting from deep neural networks. In 2016 23rd international conference on pattern recognition (ICPR), pages 2464–2469. IEEE.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Accelerating llama inference by enabling intermediate layer decoding via instruction tuning with lite. arXiv e-prints, pages arXiv–2310.
- Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
- Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR.
- Q8bert: Quantized 8bit bert. In 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS), pages 36–39. IEEE.
- Tim: Teaching lm to translate with comparison. In ArXiv.
- Diffusum: Generation enhanced extractive summarization with diffusion. arXiv preprint arXiv:2305.01735.
- Synchronous bidirectional inference for neural sequence generation. Artificial Intelligence, 281:103234.
- Retrieve anything to augment large language models. arXiv preprint arXiv:2310.07554.