Emergent Mind

Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems

(2312.15234)
Published Dec 23, 2023 in cs.LG , cs.AI , cs.DC , and cs.PF

Abstract

In the rapidly evolving landscape of artificial intelligence (AI), generative large language models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However, the computational intensity and memory consumption of deploying these models present substantial challenges in terms of serving efficiency, particularly in scenarios demanding low latency and high throughput. This survey addresses the imperative need for efficient LLM serving methodologies from a machine learning system (MLSys) research perspective, standing at the crux of advanced AI innovations and practical system optimizations. We provide in-depth analysis, covering a spectrum of solutions, ranging from cutting-edge algorithmic modifications to groundbreaking changes in system designs. The survey aims to provide a comprehensive understanding of the current state and future directions in efficient LLM serving, offering valuable insights for researchers and practitioners in overcoming the barriers of effective LLM deployment, thereby reshaping the future of AI.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a detailed summary of this paper with a premium account.

We ran into a problem analyzing this paper.

Please try again later (sorry!).

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.

References
  1. NVIDIA Effective Transformer. https://github.com/bytedance/effective_transformer. Commit: e406421, Accessed on: 2023-11-25.

  2. NVIDIA FasterTransformer. https://github.com/NVIDIA/FasterTransformer. Commit: df4a753, Accessed on: 2023-11-25.

  3. DeepSpeed Inference. https://github.com/microsoft/DeepSpeed. Commit: 2afa1c7, Accessed on: 2023-11-25.

  4. NVIDIA H100 Tensor Core GPU Architecture. https://resources.nvidia.com/en-us-tensor-core/gtc22-whitepaper-hopper. Accessed on: 2023-11-25.

  5. AnyScale LLMPerf leaderboard. https://github.com/ray-project/llmperf-leaderboard. Accessed on: 2023-12-23.

  6. AWS Inferentia. https://aws.amazon.com/blogs/machine-learning/deploy-large-language-models-on-aws-inferentia2-using-large-model-inference-containers/.

  7. ChatGLM2-6B. https://huggingface.co/THUDM/chatglm2-6b.

  8. CTranslate2. https://github.com/OpenNMT/CTranslate2. Commit: d963499, Accessed on: 2023-11-25.

  9. 2023a. DeepSpeed-FastGen. https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen. Accessed on: 2023-11-25.

  10. DeepSpeed-Inference v.s. ZeRO-Inference. https://github.com/microsoft/DeepSpeed/issues/4234. Accessed on: 2023-11-25.

  11. 2023b. DeepSpeed-MII. https://github.com/microsoft/DeepSpeed-MII. Commit: f34b772, Accessed on: 2023-11-25.

  12. 2023a. FlexFlow-Serve. https://github.com/Flexflow/FlexFlow/tree/inference. Commit: 672cdad, Accessed on: 2023-11-25.

  13. 2023b. FlexGen. https://github.com/FMInference/FlexGen. Commit: d34f7b4, Accessed on: 2023-11-25.

  14. ggml. https://github.com/ggerganov/ggml. Commit: a5e4560, Accessed on: 2023-11-25.

  15. gpt-fast. https://github.com/pytorch-labs/gpt-fast. Commit: 8c8c463, Accessed on: 2023-12-23.

  16. Graphcore. https://www.graphcore.ai/posts/dolly-2.0-open-source-language-model-with-chatgpt-like-interactivity.

  17. Graphcore PopTransformer. https://github.com/graphcore/PopTransformer. Commit: 1314598, Accessed on: 2023-11-25.

  18. Huggingface Text Generation Inference. https://github.com/huggingface/text-generation-inference. Commit: 3c02262, Accessed on: 2023-11-25.

  19. Intel Extension for Transformers. https://github.com/intel/intel-extension-for-transformers. Commit: 37d4007, Accessed on: 2023-12-23.

  20. InterLM LMDeploy. https://github.com/InternLM/lmdeploy. Commit: c07f60f, Accessed on: 2023-11-25.

  21. LightLLM. https://github.com/ModelTC/lightllm. Commit: 84671a7, Accessed on: 2023-11-25.

  22. Llama-v2-7b benchmark. https://hamel.dev/notes/llm/inference/03_inference.html. Accessed on: 2023-11-25.

  23. NVIDIA cuDNN MultiHeadAttn. https://docs.nvidia.com/deeplearning/cudnn/api/index.html##cudnnMultiHeadAttnForward. Accessed on: 2023-11-25.

  24. NVIDIA CUTLASS. https://github.com/NVIDIA/cutlass. Commit: b5d8a5d, Accessed on: 2023-11-25.

  25. NVIDIA TensorRT-LLM. https://github.com/NVIDIA/TensorRT-LLM. Commit: 6837c81, Accessed on: 2023-11-25.

  26. OpenLLM. https://github.com/bentoml/OpenLLM. Commit: b4ea4b3, Accessed on: 2023-11-25.

  27. RayLLM. https://github.com/ray-project/ray-llm. Commit: fa3a766, Accessed on: 2023-11-25.

  28. Sambanova. https://sambanova.ai/press/sambanova-unveils-new-chip-the-sn40l/.

  29. vLLM. https://github.com/vllm-project/vllm. Commit: 7c60044, Accessed on: 2023-11-25.

  30. Xorbits Inference (Xinference). https://github.com/xorbitsai/inference. Commit: 22732d8, Accessed on: 2023-11-25.

  31. SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
  32. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
  33. Batch: machine learning inference serving on serverless platforms with adaptive batching. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–15.
  34. LLM in a flash: Efficient Large Language Model Inference with Limited Memory
  35. DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale
  36. Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers
  37. Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding
  38. LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
  39. {{\{{PipeSwitch}}\}}: Fast pipelined context switching for deep learning applications. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 499–514.
  40. Exponentially Faster Language Modelling
  41. Longformer: The Long-Document Transformer
  42. Tal Ben-Nun and Torsten Hoefler. 2019. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. ACM Computing Surveys (CSUR) 52, 4 (2019), 1–43.
  43. Improving language models by retrieving from trillions of tokens. In International conference on machine learning. PMLR, 2206–2240.
  44. Petals: Collaborative Inference and Fine-tuning of Large Models
  45. Distributed Inference and Fine-tuning of Large Language Models Over The Internet
  46. Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers
  47. F Warren Burton. 1985. Speculative computation, parallelism, and functional programming. IEEE Trans. Comput. 100, 12 (1985), 1190–1193.
  48. Medusa: Simple framework for accelerating llm generation with multiple decoding heads. https://github.com/FasterDecoding/Medusa. Commit: dd9c8a5, Accessed on: 2023-11-25.

  49. DSFormer: Effective Compression of Text-Transformers by Dense-Sparse Weight Factorization
  50. Carol Chen. 2022. Transformer Inference Arithmetic. https://kipp.ly/blog/transformer-inference-arithmetic/. Accessed on: 2023-11-25.

  51. Accelerating Large Language Model Decoding with Speculative Sampling
  52. Punica: Multi-Tenant LoRA Serving
  53. FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance
  54. Evaluating Large Language Models Trained on Code
  55. Et: re-thinking self-attention for transformer models on gpus. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–18.
  56. Extending Context Window of Large Language Models via Positional Interpolation
  57. Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks
  58. Adapting Language Models to Compress Contexts
  59. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys.org/blog/2023-03-30-vicuna/

  60. Generating Long Sequences with Sparse Transformers
  61. Accelerating transformer networks through recomposing softmax layers. In 2022 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 92–103.
  62. PaLM: Scaling Language Modeling with Pathways
  63. Adaptively Sparse Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2174–2184.
  64. SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
  65. LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking
  66. Databricks. 2023. LLM Inference Performance Engineering: Best Practices. https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices Accessed on: 2023-11-25.

  67. Language modeling with gated convolutional networks. In International conference on machine learning. PMLR, 933–941.
  68. SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference
  69. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
  70. QLoRA: Efficient Finetuning of Quantized LLMs
  71. SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression
  72. The case for 4-bit precision: k-bit Inference Scaling Laws
  73. Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster
  74. LongNet: Scaling Transformers to 1,000,000,000 Tokens
  75. Towards Next-Generation Intelligent Assistants Leveraging LLM Techniques. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 5792–5793.
  76. Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs. ACM Transactions on Architecture and Code Optimization 20, 4 (2023), 1–22.
  77. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning. PMLR, 5547–5569.
  78. A Comprehensive Performance Study of Large Language Models on Novel AI Accelerators
  79. LLMCarbon: Modeling the end-to-end Carbon Footprint of Large Language Models
  80. Reducing Transformer Depth on Demand with Structured Dropout. In International Conference on Learning Representations.
  81. Hierarchical Neural Story Generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 889–898.
  82. Turbotransformers: an efficient gpu serving system for transformer models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 389–402.
  83. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research 23, 1 (2022), 5232–5270.
  84. The CoRa tensor compiler: Compilation for ragged tensors with minimal padding. Proceedings of Machine Learning and Systems 4 (2022), 721–747.
  85. Extending Context Window of Large Language Models via Semantic Compression
  86. Tensorir: An abstraction for automatic tensorized program optimization. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 804–817.
  87. Elias Frantar and Dan Alistarh. 2023. SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot. (2023).
  88. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
  89. OPTQ: Accurate quantization for generative pre-trained transformers. In The Eleventh International Conference on Learning Representations.
  90. Compiling machine learning programs via high-level tracing. Systems for Machine Learning 4, 9 (2018).
  91. Hungry Hungry Hippos: Towards Language Modeling with State Space Models. In The Eleventh International Conference on Learning Representations.
  92. Breaking the Sequential Dependency of LLM Inference Using Lookahead Decoding. https://lmsys.org/blog/2023-11-21-lookahead-decoding/

  93. MegaBlocks: Efficient Sparse Training with Mixture-of-Experts. Proceedings of Machine Learning and Systems 5 (2023).
  94. Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs. In Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@ NeurIPS 2023).
  95. In-context Autoencoder for Context Compression in a Large Language Model
  96. Lossless Acceleration for Seq2seq Generation with Aggressive Decoding
  97. Mask-Predict: Parallel Decoding of Conditional Masked Language Models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 6112–6121.
  98. Semi-Autoregressive Training Improves Mask-Predict Decoding
  99. A survey of quantization methods for efficient neural network inference. In Low-Power Computer Vision. Chapman and Hall/CRC, 291–326.
  100. Prompt Cache: Modular Attention Reuse for Low-Latency Inference
  101. PoWER-BERT: Accelerating BERT inference via progressive word-vector elimination. In International Conference on Machine Learning. PMLR, 3690–3699.
  102. Mamba: Linear-Time Sequence Modeling with Selective State Spaces
  103. Efficiently Modeling Long Sequences with Structured State Spaces. In International Conference on Learning Representations.
  104. Non-autoregressive neural machine translation. In International Conference on Learning Representations (ICLR).
  105. Jiatao Gu and Xiang Kong. 2021. Fully Non-autoregressive Neural Machine Translation: Tricks of the Trade. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 120–133.
  106. MiniLLM: Knowledge Distillation of Large Language Models
  107. Cocktail: A multidimensional optimization for model serving in cloud. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). 1041–1057.
  108. Non-autoregressive neural machine translation with enhanced decoder input. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 3723–3730.
  109. STI: Turbocharge NLP Inference at the Edge via Elastic Pipelining. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 791–803.
  110. Star-Transformer. In Proceedings of NAACL-HLT. 1315–1325.
  111. GMAT: Global Memory Augmentation for Transformers
  112. Memory-efficient Transformers via Top-k Attention. In Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing. 39–52.
  113. Manish Gupta and Puneet Agrawal. 2022. Compression of deep learning models for text: A survey. ACM Transactions on Knowledge Discovery from Data (TKDD) 16, 4 (2022), 1–55.
  114. Microsecond-scale preemption for concurrent {{\{{GPU-accelerated}}\}}{{\{{DNN}}\}} inferences. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 539–558.
  115. Simplifying Transformer Blocks
  116. FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 120–134.
  117. Magic Pyramid: Accelerating Inference with Early Exiting and Token Pruning
  118. REST: Retrieval-Based Speculative Decoding
  119. The Curious Case of Neural Text Degeneration
  120. FlashDecoding++: Faster Large Language Model Inference on GPUs
  121. SPEED: Speculative Pipelined Execution for Efficient Decoding
  122. Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference
  123. Tutel: Adaptive mixture-of-experts at scale. Proceedings of Machine Learning and Systems 5 (2023).
  124. Calculon: a methodology and tool for high-level co-design of systems and large language models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14.
  125. GPT-Zip: Deep Compression of Finetuned Large Language Models. In Workshop on Efficient Systems for Foundation Models@ ICML2023.
  126. Compressing LLMs: The Truth is Rarely Pure and Never Simple
  127. TASO: optimizing deep learning computation with automatic generation of graph substitutions. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 47–62.
  128. Beyond Data and Model Parallelism for Deep Neural Networks. Proceedings of Machine Learning and Systems 1 (2019), 1–13.
  129. Mistral 7B
  130. LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models
  131. LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression
  132. HexGen: Generative Inference of Large-Scale Foundation Model over Heterogeneous Decentralized Environment
  133. TinyBERT: Distilling BERT for Natural Language Understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020. 4163–4174.
  134. S$^{3}$: Increasing GPU Utilization during Generative Inference for Higher Throughput
  135. A Jo. 2023. The promise and peril of generative AI. Nature 614, 1 (2023), 214–216.
  136. Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. In Proceedings of the 50th Annual International Symposium on Computer Architecture. 1–14.
  137. Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine Translation. In International Conference on Learning Representations.
  138. MLIR-based code generation for GPU tensor cores. In Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction. 117–128.
  139. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning. PMLR, 5156–5165.
  140. CTRL: A Conditional Transformer Language Model for Controllable Generation
  141. Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 284–294.
  142. Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization
  143. SqueezeLLM: Dense-and-Sparse Quantization
  144. Full Stack Optimization of Transformer Inference: a Survey
  145. Speculative Decoding with Big Little Decoder
  146. Reformer: The Efficient Transformer. In International Conference on Learning Representations.
  147. Accelerating Inference for Pretrained Language Models by Unified Multi-Perspective Early Exiting. In Proceedings of the 29th International Conference on Computational Linguistics. 4677–4686.
  148. Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference. In Findings of the Association for Computational Linguistics: EMNLP 2021. 3577–3599.
  149. ZipLM: Inference-Aware Structured Pruning of Language Models
  150. Efficient Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the 29th Symposium on Operating Systems Principles. 611–626.
  151. Relax: Composable Abstractions for End-to-End Dynamic Machine Learning
  152. Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  153. xFormers: A modular and hackable Transformer modelling library. https://github.com/facebookresearch/xformers. Commit: fbf349a, Accessed on: 2023-11-25.

  154. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. In International Conference on Learning Representations.
  155. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning. PMLR, 19274–19286.
  156. Accelerating Distributed {{\{{MoE}}\}} Training and Inference with Lina. In 2023 USENIX Annual Technical Conference (USENIX ATC 23). 945–959.
  157. CascadeBERT: Accelerating Inference of Pre-trained Language Models via Calibrated Complete Models Cascade
  158. A Speed Odyssey for Deployable Quantization of LLMs
  159. Compressing Context to Enhance Inference Efficiency of Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 6342–6353.
  160. An efficient transformer decoder with compressed sub-layers. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 13315–13323.
  161. LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation
  162. {{\{{AlpaServe}}\}}: Statistical Multiplexing with Model Parallelism for Deep Learning Serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). 663–679.
  163. A global past-future early exit method for accelerating inference of pre-trained language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2013–2023.
  164. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
  165. Ring Attention with Blockwise Transformers for Near-Infinite Context
  166. Lost in the Middle: How Language Models Use Long Contexts
  167. FastBERT: a Self-distilling BERT with Adaptive Inference Time. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 6035–6044.
  168. Online Speculative Decoding
  169. CacheGen: KV Cache Compression and Streaming for Fast Language Model Serving
  170. Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time
  171. LLM-QAT: Data-Free Quantization Aware Training for Large Language Models
  172. Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning. PMLR, 22137–22176.
  173. BumbleBee: Secure Two-party Inference Framework for Large Transformers. Cryptology ePrint Archive (2023).
  174. LLM-Pruner: On the Structural Pruning of Large Language Models
  175. Ruben Mayer and Hans-Arno Jacobsen. 2020. Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools. ACM Computing Surveys (CSUR) 53, 1 (2020), 1–37.
  176. Long Range Language Modeling via Gated State Spaces. In The Eleventh International Conference on Learning Representations.
  177. SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification
  178. SpotServe: Serving Generative Large Language Models on Preemptible Instances. Proceedings of ASPLOS Conference (2024).
  179. Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism. Proc. VLDB Endow. 16, 3 (2023), 470–479. https://doi.org/10.14778/3570690.3570697
  180. Are sixteen heads really better than one? Advances in neural information processing systems 32 (2019)
  181. Online normalizer calculation for softmax
  182. Accelerating Sparse Deep Neural Networks
  183. AdapLeR: Speeding up Inference by Adaptive Length Reduction
  184. Landmark Attention: Random-Access Infinite Context Length for Transformers
  185. PaSS: Parallel Speculative Sampling
  186. Learning to Compress Prompts with Gist Tokens
  187. Generating Benchmarks for Factuality Evaluation of Language Models
  188. Saturn: An Optimized Data System for Large Model Deep Learning Workloads
  189. Memory-efficient pipeline-parallel dnn training. In International Conference on Machine Learning. PMLR, 7937–7947.
  190. Cheaply Estimating Inference Efficiency Metrics for Autoregressive Transformer Models. In Thirty-seventh Conference on Neural Information Processing Systems.
  191. Paella: Low-latency Model Serving with Software-defined GPU Scheduling. In Proceedings of the 29th Symposium on Operating Systems Principles. 595–610.
  192. EvoMoE: An Evolutional Mixture-of-Experts Training Framework via Dense-To-Sparse Gate
  193. FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement. Proceedings of the ACM on Management of Data 1, 1 (2023), 1–19.
  194. The statistical recurrent unit. In International Conference on Machine Learning. PMLR, 2671–2680.
  195. GPT-4 Technical Report
  196. Resurrecting Recurrent Neural Networks for Long Sequences
  197. MemGPT: Towards LLMs as Operating Systems
  198. Faster Causal Attention Over Large Sequences Through Sparse Flash Attention
  199. LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models
  200. RWKV: Reinventing RNNs for the Transformer Era
  201. Instruction Tuning with GPT-4
  202. Chiplet Cloud: Building AI Supercomputers for Serving Large Generative Language Models
  203. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems 5 (2023).
  204. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. In International Conference on Learning Representations.
  205. Qualcomm. 2023. The future of AI is hybrid. https://www.qualcomm.com/content/dam/qcomm-martech/dm-assets/documents/Whitepaper-The-future-of-AI-is-hybrid-Part-2-Qualcomm-is-uniquely-positioned-to-scale-hybrid-AI.pdf. Accessed on: 2023-11-25.

  206. Self-attention Does Not Need $O(n^2)$ Memory
  207. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. In International Conference on Machine Learning. PMLR, 18332–18346.
  208. Zero-shot text-to-image generation. In International Conference on Machine Learning. PMLR, 8821–8831.
  209. Mlperf inference benchmark. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 446–459.
  210. Hash layers for large sparse models. Advances in Neural Information Processing Systems 34 (2021), 17555–17566.
  211. Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics 9 (2021), 53–68.
  212. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. (2014).
  213. Apache TVM Unity: a vision for the ML software and hardware ecosystem
  214. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
  215. Movement pruning: Adaptive sparsity by fine-tuning. Advances in Neural Information Processing Systems 33 (2020), 20378–20389.
  216. What Matters In The Structured Pruning of Generative Language Models?
  217. Accelerating Transformer Inference for Translation via Parallel Decoding
  218. Memory Augmented Language Models through Mixture of Word Experts
  219. Consistent Accelerated Inference via Confident Adaptive Transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 4962–4979.
  220. Fast Transformer Decoding: One Write-Head is All You Need
  221. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
  222. Efficient LLM Inference on CPUs
  223. S-LoRA: Serving Thousands of Concurrent LoRA Adapters
  224. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA (Proceedings of Machine Learning Research, Vol. 202). PMLR, 31094–31116.
  225. Xing Shi and Kevin Knight. 2017. Speeding up neural machine translation decoding by shrinking run-time vocabulary. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 574–579.
  226. Welder: Scheduling Deep Learning Memory Access via Tile-graph. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). 701–718.
  227. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
  228. PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
  229. Accelerating LLM Inference with Staged Speculative Decoding
  230. Blockwise parallel decoding for deep autoregressive models. Advances in Neural Information Processing Systems 31 (2018).
  231. RoFormer: Enhanced Transformer with Rotary Position Embedding
  232. A Simple and Effective Pruning Approach for Large Language Models
  233. Patient Knowledge Distillation for BERT Model Compression. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 4323–4332.
  234. A Simple Hash-Based Early Exiting Approach For Language Understanding and Generation
  235. Retentive Network: A Successor to Transformer for Large Language Models
  236. Spectr: Fast speculative decoding via optimal transport. In Workshop on Efficient Systems for Foundation Models@ ICML2023.
  237. FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs
  238. Stanford alpaca: An instruction-following llama model
  239. Sparse sinkhorn attention. In International Conference on Machine Learning. PMLR, 9438–9447.
  240. Efficient Transformers: A Survey. ACM Comput. Surv. 55, 6 (2023), 109:1–109:28. https://doi.org/10.1145/3530811

  241. DeciAI Research Team. 2023. DeciLM 6B. https://huggingface.co/Deci/DeciLM-6b

  242. MLC team. 2023. MLC-LLM. https://github.com/mlc-ai/mlc-llm Commit: 3358029, Accessed on: 2023-11-25.

  243. Branchynet: Fast inference via early exiting from deep neural networks. In 2016 23rd international conference on pattern recognition (ICPR). IEEE, 2464–2469.
  244. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. 10–19.
  245. Mlp-mixer: An all-mlp architecture for vision. Advances in neural information processing systems 34 (2021), 24261–24272.
  246. AutoML in the Age of Large Language Models: Current Challenges, Future Opportunities and Risks
  247. Llama 2: Open Foundation and Fine-Tuned Chat Models
  248. Efficient methods for natural language processing: A survey. Transactions of the Association for Computational Linguistics 11 (2023), 826–860.
  249. Francisco Massa Grigory Sizov Tri Dao, Daniel Haziza. [n. d.]. Flash-Decoding for long-context inference, year = 2023, = https://pytorch.org/blog/flash-decoding/,.

  250. Unity: Accelerating {{\{{DNN}}\}} training through joint optimization of algebraic transformations and parallelization. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 267–284.
  251. Mini-GPTs: Efficient Large Language Models through Contextual Pruning
  252. Robert A Van De Geijn and Jerrell Watts. 1997. SUMMA: Scalable universal matrix multiplication algorithm. Concurrency: Practice and Experience 9, 4 (1997), 255–274.
  253. Attention is all you need. Advances in neural information processing systems 30 (2017).
  254. Linformer: Self-Attention with Linear Complexity
  255. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems 33 (2020), 5776–5788.
  256. LightSeq: A High Performance Inference Library for Transformers
  257. Tabi: An Efficient Multi-Level Inference System for Large Language Models. In Proceedings of the Eighteenth European Conference on Computer Systems. 233–248.
  258. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
  259. {{\{{MLaaS}}\}} in the wild: Workload analysis and scheduling in {{\{{Large-Scale}}\}} heterogeneous {{\{{GPU}}\}} clusters. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). 945–960.
  260. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
  261. Fast Distributed Inference Serving for Large Language Models
  262. TenTrans High-Performance Inference Toolkit for WMT2021 Efficiency Task. In Proceedings of the Sixth Conference on Machine Translation. 795–798.
  263. Speeding up Transformer Decoding via an Attention Refinement Network. In Proceedings of the 29th International Conference on Computational Linguistics. 5109–5118.
  264. Peng Wu. 2023. PyTorch 2.0: The Journey to Bringing Compiler Technologies to the Core of PyTorch (Keynote). In Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization. 1–1.
  265. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
  266. Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases
  267. Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
  268. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
  269. Efficient Streaming Language Models with Attention Sinks
  270. Sharing Attention Weights for Fast Transformer. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, Sarit Kraus (Ed.). ijcai.org, 5292–5298. https://doi.org/10.24963/ijcai.2019/735

  271. A survey on non-autoregressive generation for neural machine translation and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
  272. DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2246–2251.
  273. WizardLM: Empowering Large Language Models to Follow Complex Instructions
  274. LLMCad: Fast and Scalable On-device Large Language Model Inference
  275. Retrieval meets Long Context Large Language Models
  276. Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt
  277. Baichuan 2: Open Large-scale Language Models
  278. Inference with Reference: Lossless Acceleration of Large Language Models
  279. Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding
  280. ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation
  281. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in Neural Information Processing Systems 35 (2022), 27168–27183.
  282. TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 5798–5809.
  283. SparseTIR: Composable abstractions for sparse compilation in deep learning. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. 660–678.
  284. Anil Yemme and Shayan Srinivasa Garani. 2023. A Scalable GPT-2 Inference Hardware Architecture on FPGA. In 2023 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–8.
  285. Orca: A Distributed Serving System for Transformer-Based Generative Models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 521–538.
  286. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10819–10829.
  287. RPTQ: Reorder-based Post-training Quantization for Large Language Models
  288. Large Language Model Cascades with Mixture of Thoughts Representations for Cost-efficient Reasoning
  289. Big bird: Transformers for longer sequences. Advances in neural information processing systems 33 (2020), 17283–17297.
  290. GLM-130B: An Open Bilingual Pre-trained Model
  291. Learning to Skip for Language Modeling
  292. An Attention Free Transformer
  293. Bytetransformer: A high-performance transformer boosted for variable-length inputs. In 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 344–355.
  294. DePA: Improving Non-autoregressive Translation with Dependency-Aware Decoder. In Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023). 478–490.
  295. {{\{{MArk}}\}}: Exploiting Cloud Services for {{\{{Cost-Effective}}\}},{{\{{SLO-Aware}}\}} Machine Learning Inference Serving. In 2019 USENIX Annual Technical Conference (USENIX ATC 19). 1049–1062.
  296. Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding
  297. Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models
  298. LatticeGen: A Cooperative Framework which Hides Generated Text in a Lattice for Privacy-Aware Generation on Cloud
  299. Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer. In Findings of the Association for Computational Linguistics: EMNLP 2023. 2775–2786.
  300. OPT: Open Pre-trained Transformer Language Models
  301. H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
  302. Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
  303. Alpa: Automating inter-and {{\{{Intra-Operator}}\}} parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 559–578.
  304. {{\{{EINNET}}\}}: Optimizing Tensor Programs with {{\{{Derivation-Based}}\}} Transformations. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). 739–755.
  305. PIT: Optimization of Dynamic Sparse Deep Learning Models via Permutation Invariant Transformation. In Proceedings of the 29th Symposium on Operating Systems Principles. 331–347.
  306. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI conference on artificial intelligence, Vol. 35. 11106–11115.
  307. Transpim: A memory-based acceleration via software-hardware co-design for transformer. In 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 1071–1085.
  308. Bert loses patience: Fast and robust inference with early exit. Advances in Neural Information Processing Systems 33 (2020), 18330–18341.
  309. Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems 35 (2022), 7103–7114.
  310. DistillSpec: Improving Speculative Decoding via Knowledge Distillation
  311. {{\{{PetS}}\}}: A Unified Framework for {{\{{Parameter-Efficient}}\}} Transformers Serving. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). 489–504.
  312. On Optimal Caching and Model Multiplexing for Large Model Inference
  313. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
  314. A Survey on Model Compression for Large Language Models
  315. Falcon LLM: A New Frontier in Natural Language Processing. AC Investment Research Journal 220, 44 (2023).

Show All 315