LLM Inference Unveiled: Survey and Roofline Model Insights (2402.16363v6)
Abstract: The field of efficient LLM inference is rapidly evolving, presenting a unique blend of opportunities and challenges. Although the field has expanded and is vibrant, there hasn't been a concise framework that analyzes the various methods of LLM Inference to provide a clear understanding of this domain. Our survey stands out from traditional literature reviews by not only summarizing the current state of research but also by introducing a framework based on roofline model for systematic analysis of LLM inference techniques. This framework identifies the bottlenecks when deploying LLMs on hardware devices and provides a clear understanding of practical problems, such as why LLMs are memory-bound, how much memory and computation they need, and how to choose the right hardware. We systematically collate the latest advancements in efficient LLM inference, covering crucial areas such as model compression (e.g., Knowledge Distillation and Quantization), algorithm improvements (e.g., Early Exit and Mixture-of-Expert), and both hardware and system-level enhancements. Our survey stands out by analyzing these methods with roofline model, helping us understand their impact on memory access and computation. This distinctive approach not only showcases the current research landscape but also delivers valuable insights for practical implementation, positioning our work as an indispensable resource for researchers new to the field as well as for those seeking to deepen their understanding of efficient LLM deployment. The analyze tool, LLM-Viewer, is open-sourced.
- Gkd: Generalized knowledge distillation for auto-regressive sequence models. arXiv preprint arXiv:2306.13649.
- Fast and robust early-exiting framework for autoregressive language models with synchronized parallel decoding.
- N-ode transformer: A depth-adaptive variant of the transformer using neural ordinary differential equations.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774.
- Dialogue chain-of-thought distillation for commonsense-aware conversational agents. arXiv preprint arXiv:2310.09343.
- Quip: 2-bit quantization of large language models with guarantees. Advances in Neural Information Processing Systems.
- Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318.
- Ee-llm: Large-scale training and inference of early-exit large language models with 3d parallelism.
- Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085.
- Palm: Scaling language modeling with pathways.
- Skipdecode: Autoregressive skip decoding with batching and caching for efficient llm inference.
- Binaryconnect: Training deep neural networks with binary weights during propagations. Advances in neural information processing systems, 28.
- Fido: Fusion-in-decoder optimized for stronger performance and faster inference.
- Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
- Spqr: A sparse-quantized representation for near-lossless llm weight compression. arXiv preprint arXiv:2306.03078.
- The case for 4-bit precision: k-bit inference scaling laws. In International Conference on Machine Learning, pages 7750–7774. PMLR.
- Bert: Pre-training of deep bidirectional transformers for language understanding.
- Matformer: Nested transformer for elastic inference.
- Jump to conclusions: Short-cutting transformers with linear transformations.
- The efficiency spectrum of large language models: An algorithmic survey. arXiv preprint arXiv:2312.00678.
- Hawq: Hessian aware quantization of neural networks with mixed-precision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 293–302.
- Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pages 5547–5569. PMLR.
- Depth-adaptive transformer.
- A review of sparse expert models in deep learning. arXiv preprint arXiv:2209.01667.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.(2021). arXiv preprint cs.LG/2101.03961.
- Sparsegpt: Massive language models can be accurately pruned in one-shot. ICML.
- Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323.
- Breaking the sequential dependency of llm inference using lookahead decoding.
- Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726.
- Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space.
- Mask-predict: Parallel decoding of conditional masked language models.
- A survey of quantization methods for efficient neural network inference. In Low-Power Computer Vision, pages 291–326. Chapman and Hall/CRC.
- Knowledge distillation: A survey. International Journal of Computer Vision, 129:1789–1819.
- Non-autoregressive neural machine translation. arXiv preprint arXiv:1711.02281.
- Fully non-autoregressive neural machine translation: Tricks of the trade.
- Knowledge distillation of large language models. arXiv preprint arXiv:2306.08543.
- Olive: Accelerating large language models via hardware-friendly outlier-victim pair quantization. In Proceedings of the 50th Annual International Symposium on Computer Architecture, pages 1–15.
- Lq-lora: Low-rank plus quantized matrix decomposition for efficient language model finetuning. arXiv preprint arXiv:2311.12023.
- Jointly masked sequence-to-sequence model for non-autoregressive neural machine translation. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J., editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 376–385, Online. Association for Computational Linguistics.
- Compresso: Structured pruning with collaborative prompting learns compact large language models. arXiv preprint arXiv:2310.05015.
- Rest: Retrieval-based speculative decoding. arXiv preprint arXiv:2311.08252.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
- Training compute-optimal large language models.
- Dynabert: Dynamic bert with adaptive width and depth.
- Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
- Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. arXiv preprint arXiv:2304.01933.
- Billm: Pushing the limit of post-training quantization for llms. arXiv preprint arXiv:2402.04291.
- In-context learning distillation: Transferring few-shot learning ability of pre-trained language models. arXiv preprint arXiv:2212.10670.
- Binarized neural networks. Advances in neural information processing systems, 29.
- Automoe: Heterogeneous mixture-of-experts with adaptive computation for efficient neural machine translation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 9116–9132.
- Early exit with disentangled representation and equiangular tight frame. In Rogers, A., Boyd-Graber, J., and Okazaki, N., editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 14128–14142, Toronto, Canada. Association for Computational Linguistics.
- Mixtral of experts. arXiv preprint arXiv:2401.04088.
- Lion: Adversarial distillation of closed-source large language model. arXiv preprint arXiv:2305.12870.
- Scaling laws for neural language models.
- Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization. arXiv preprint arXiv:2305.14152.
- Token-scaled logit distillation for ternary weight generative language models. arXiv preprint arXiv:2308.06744.
- Squeezellm: Dense-and-sparse quantization. arXiv preprint arXiv:2306.07629.
- Full stack optimization of transformer inference: a survey.
- Speculative decoding with big little decoder.
- Literature survey on low rank approximation of matrices. Linear and Multilinear Algebra, 65(11):2212–2244.
- Optimizing mixture of experts using dynamic recompilations. arXiv preprint arXiv:2205.01848.
- Copy is all you need.
- Optimal brain damage. Advances in neural information processing systems, 2.
- Owq: Lessons learned from activation outliers for weight quantization in large language models. arXiv preprint arXiv:2306.02272.
- Deterministic non-autoregressive neural sequence modeling by iterative refinement.
- Sparse mixers: Combining moe and mixing to build a more efficient bert. arXiv preprint arXiv:2205.12399.
- Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668.
- Fast inference from transformers via speculative decoding.
- Norm tweaking: High-performance low-bit quantization of large language models. arXiv preprint arXiv:2309.02784.
- Cascadebert: Accelerating inference of pre-trained language models via calibrated complete models cascade.
- Symbolic chain-of-thought distillation: Small models can also” think” step-by-step. arXiv preprint arXiv:2306.14050.
- Fptq: Fine-grained post-training quantization for large language models. arXiv preprint arXiv:2308.15987.
- Eagle: Speculative sampling requires rethinking feature uncertainty. arXiv preprint arXiv:2401.15077.
- Loftq: Lora-fine-tuning-aware quantization for large language models. arXiv preprint arXiv:2310.08659.
- Hint-based training for non-autoregressive machine translation. arXiv preprint arXiv:1909.06708.
- Qft: Quantized full-parameter tuning of llms with affordable resources. arXiv preprint arXiv:2310.07147.
- Homodistil: Homotopic task-agnostic distillation of pre-trained transformers. arXiv preprint arXiv:2302.09632.
- Less is more: Task-aware layer-wise distillation for language model compression. In International Conference on Machine Learning, pages 20852–20867. PMLR.
- Pruning and quantization for deep neural network acceleration: A survey. Neurocomputing, 461:370–403.
- Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978.
- Apar: Llms can do auto-parallel auto-regressive decoding.
- Fastbert: a self-distilling bert with adaptive inference time.
- Online speculative decoding. arXiv preprint arXiv:2310.07177.
- Towards efficient NLP: A standard evaluation and a strong baseline. In Carpuat, M., de Marneffe, M.-C., and Meza Ruiz, I. V., editors, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3288–3303, Seattle, United States. Association for Computational Linguistics.
- Llm-qat: Data-free quantization aware training for large language models. arXiv preprint arXiv:2305.17888.
- Deja vu: Contextual sparsity for efficient llms at inference time.
- Llm-pruner: On the structural pruning of large language models. arXiv preprint arXiv:2305.11627.
- Towards efficient generative large language model serving: A survey from algorithms to systems. arXiv preprint arXiv:2312.15234.
- Specinfer: Accelerating generative llm serving with speculative inference and token tree verification. arXiv preprint arXiv:2305.09781.
- Pass: Parallel speculative sampling. arXiv preprint arXiv:2311.13581.
- Skeleton-of-thought: Large language models can do parallel decoding.
- Gpt-4 technical report.
- Iterative solution of nonlinear equations in several variables. SIAM.
- Oseledets, I. V. (2011). Tensor-train decomposition. SIAM Journal on Scientific Computing, 33(5):2295–2317.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Lut-gemm: Quantized matrix multiplication based on luts for efficient inference in large-scale generative language models. arXiv preprint arXiv:2206.09557.
- Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. In International Conference on Machine Learning, pages 18332–18346. PMLR.
- Finding the sweet spot: Analysis and improvement of adaptive inference in low resource settings.
- Promptmix: A class boundary augmentation method for large language model distillation. arXiv preprint arXiv:2310.14192.
- Accelerating transformer inference for translation via parallel decoding. arXiv preprint arXiv:2305.10427.
- Step-unrolled denoising autoencoders for text generation. arXiv preprint arXiv:2112.06749.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
- Low-rank lottery tickets: finding efficient low-rank neural networks via matrix differential equations. Advances in Neural Information Processing Systems, 35:20051–20063.
- Confident adaptive language modeling.
- Consistent accelerated inference via confident adaptive transformers.
- The right tool for the job: Matching model and instance complexities. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J., editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6640–6651, Online. Association for Computational Linguistics.
- Lipschitz continuity guided knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10675–10684.
- Pb-llm: Partially binarized large language models. In ICLR.
- Minimizing the bag-of-ngrams difference for non-autoregressive neural machine translation. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 198–205.
- The truth is in there: Improving reasoning in language models with layer-selective rank reduction. arXiv preprint arXiv:2312.13558.
- How many layers and why? An analysis of the model depth in transformers. In Kabbara, J., Lin, H., Paullada, A., and Vamvas, J., editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop, pages 221–228, Online. Association for Computational Linguistics.
- Accelerating feedforward computation via parallel nonlinear equation solving. In Meila, M. and Zhang, T., editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 9791–9800. PMLR.
- Powerinfer: Fast large language model serving with a consumer-grade gpu.
- Blockwise parallel decoding for deep autoregressive models. Advances in Neural Information Processing Systems, 31.
- Bert and pals: Projected attention layers for efficient adaptation in multi-task learning.
- A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695.
- Fast structured decoding for sequence models. Advances in Neural Information Processing Systems, 32.
- Spectr: Fast speculative decoding via optimal transport. arXiv preprint arXiv:2310.15141.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Dylora: Parameter efficient tuning of pre-trained models using dynamic search-free low-rank adaptation. arXiv preprint arXiv:2210.07558.
- Scott: Self-consistent chain-of-thought distillation. arXiv preprint arXiv:2305.01879.
- Model compression and efficient inference for large language models: A survey. arXiv preprint arXiv:2402.09748.
- Non-autoregressive machine translation with auxiliary regularization. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 5377–5384.
- Imitation learning for non-autoregressive neural machine translation. arXiv preprint arXiv:1906.02041.
- Lamini-lm: A diverse herd of distilled models from large-scale instructions. arXiv preprint arXiv:2304.14402.
- Ad-kd: Attribution-driven knowledge distillation for language model compression. arXiv preprint arXiv:2305.10010.
- Zeroquant-fp: A leap forward in llms post-training w4a8 quantization using floating-point formats. arXiv preprint arXiv:2307.09782.
- Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity. arXiv preprint arXiv:2309.10285.
- Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694.
- Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR.
- A survey on non-autoregressive generation for neural machine translation and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence.
- Deebert: Dynamic early exiting for accelerating bert inference.
- Llmcad: Fast and scalable on-device large language model inference. arXiv preprint arXiv:2309.04255.
- Tensorgpt: Efficient compression of the embedding layer in llms based on the tensor-train decomposition. arXiv preprint arXiv:2307.00526.
- A survey on knowledge distillation of large language models.
- Qa-lora: Quantization-aware low-rank adaptation of large language models. arXiv preprint arXiv:2309.14717.
- Inference with reference: Lossless acceleration of large language models. arXiv preprint arXiv:2304.04487.
- Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in Neural Information Processing Systems, 35:27168–27183.
- Edgemoe: Fast on-device inference of moe-based large language models. arXiv preprint arXiv:2308.14352.
- Outlier weighed layerwise sparsity (owl): A missing secret sauce for pruning llms to high sparsity. arXiv preprint arXiv:2310.05175.
- Rptq: Reorder-based post-training quantization for large language models. arXiv preprint arXiv:2304.01089.
- Asvd: Activation-aware singular value decomposition for compressing large language models. arXiv preprint arXiv:2312.05821.
- Wkvquant: Quantizing weight and key/value cache for large language models gains more. arXiv preprint arXiv:2402.12065.
- Twenty years of mixture of experts. IEEE Transactions on Neural Networks and Learning Systems, 23(8):1177–1193.
- Consistentee: A consistent and hardness-guided early exiting method for accelerating language models inference.
- Draft & verify: Lossless large language model acceleration via self-speculative decoding. arXiv preprint arXiv:2309.08168.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
- A survey of large language models. arXiv preprint arXiv:2303.18223.
- Bert loses patience: Fast and robust inference with early exit.
- Distillspec: Improving speculative decoding via knowledge distillation. arXiv preprint arXiv:2310.08461.
- Zhu, W. (2021). LeeBERT: Learned early exit for BERT with cross-level optimization. In Zong, C., Xia, F., Li, W., and Navigli, R., editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2968–2980, Online. Association for Computational Linguistics.
- A survey on model compression for large language models. arXiv preprint arXiv:2308.07633.
- St-moe: Designing stable and transferable sparse expert models. arXiv preprint arXiv:2202.08906.
- Zhihang Yuan (45 papers)
- Yuzhang Shang (35 papers)
- Yang Zhou (311 papers)
- Zhen Dong (87 papers)
- Chenhao Xue (9 papers)
- Bingzhe Wu (58 papers)
- Zhikai Li (24 papers)
- Qingyi Gu (25 papers)
- Yong Jae Lee (88 papers)
- Yan Yan (241 papers)
- Beidi Chen (61 papers)
- Guangyu Sun (47 papers)
- Kurt Keutzer (199 papers)
- Zhe Zhou (33 papers)