Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems (2312.15234v1)

Published 23 Dec 2023 in cs.LG, cs.AI, cs.DC, and cs.PF
Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems

Abstract: In the rapidly evolving landscape of AI, generative LLMs stand at the forefront, revolutionizing how we interact with our data. However, the computational intensity and memory consumption of deploying these models present substantial challenges in terms of serving efficiency, particularly in scenarios demanding low latency and high throughput. This survey addresses the imperative need for efficient LLM serving methodologies from a machine learning system (MLSys) research perspective, standing at the crux of advanced AI innovations and practical system optimizations. We provide in-depth analysis, covering a spectrum of solutions, ranging from cutting-edge algorithmic modifications to groundbreaking changes in system designs. The survey aims to provide a comprehensive understanding of the current state and future directions in efficient LLM serving, offering valuable insights for researchers and practitioners in overcoming the barriers of effective LLM deployment, thereby reshaping the future of AI.

Towards Efficient Generative LLM Serving: An Expert Overview

The paper "Towards Efficient Generative LLM Serving: A Survey from Algorithms to Systems," authored by Xupeng Miao et al., from Carnegie Mellon University, offers a detailed exploration of efficient serving methodologies for generative LLMs. This overview aims to summarize key insights, methodologies, and findings discussed in the paper, catering to an audience of experienced researchers in the domain.

Introduction

Generative LLMs, fueled by Transformer-based architectures like GPT, LLaMA, and others, have advanced significantly, showcasing superior performance across various NLP tasks. Despite their success, the serving of these models poses profound computational and memory challenges, particularly concerning low-latency and high-throughput requirements in practical applications. The paper methodically addresses these challenges by exploring algorithmic modifications and system-level optimizations.

Taxonomy of LLM Serving Techniques

The paper categorizes the strategies for efficient LLM serving into two primary classes: Algorithmic Innovations and System Optimizations. This structured approach highlights the diverse methodologies aimed at optimizing LLM inference.

Algorithmic Innovations

  1. Decoding Algorithms:
    • Non-autoregressive Decoding: Techniques such as Parallel Decoding, which reframe the decoding process to allow multiple tokens to be generated in parallel, significantly reduce decoding latency but require careful management of token dependencies to maintain output quality.
    • Speculative Decoding: Methods like SpecInfer enhance decoding parallelism by predicting multiple tokens in advance and verifying them concurrently, improving throughput without compromising output accuracy.
    • Early Exiting: Employs internal classifiers to output predictions at earlier layers of the model, reducing computation for simpler queries.
    • Cascade Inference: Utilizes a hierarchy of models to process queries selectively, deploying large models only when necessary for complex requests.
  2. Architecture Design:
    • Configuration Downsizing and Attention Simplification: Techniques such as reducing model layers and simplifying attention mechanisms to lower computational intensity while preserving essential context understanding.
    • Activation and Conditional Computing: Innovations like multi-query attention (MQA) and Mixture of Experts (MoE) architectures optimize memory and computation by selectively activating model components based on the input.
  3. Model Compression:
    • Knowledge Distillation: Training smaller models under the guidance of larger ones, achieving efficiency gains while retaining performance.
    • Network Pruning: Structured pruning techniques that selectively remove components of the model to reduce memory overhead and enhance inference speed without extensive retraining.

System Optimizations

  1. Low-bit Quantization:
    • Employing techniques like Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT) to reduce model precision requirements, significantly decreasing memory consumption and accelerating inference on hardware optimized for these formats.
  2. Parallel Computation:
    • Model Parallelism: Leveraging strategies like tensor parallelism, pipeline parallelism, and sequence parallelism to distribute computational tasks across multiple GPUs or nodes.
    • Decentralized Inference: Distributed LLM inference over a network of voluntary nodes, enhancing resource utilization and scalability.
  3. Memory Management:
    • Sophisticated memory allocation strategies like paged attention and tree attention to manage the KV cache dynamically, optimizing memory usage and reducing redundancy during inference.
  4. Request Scheduling:
    • Iteration-level Scheduling: Scheduling inference tasks at the granularity of iteration rather than at the request-level to improve resource utilization and throughput.
    • Dynamic Batching and Preemption: Techniques to handle variable output lengths and prioritize shorter queries to balance load effectively.
  5. Kernel Optimization:
    • Kernel Fusion and Tailored Attention: Fusing multiple operations into singular high-performance kernels and customizing GPU kernels to optimize attention calculations.
    • Sampling Optimization: Efficiently handling large vocabularies and implementing hierarchical sampling strategies to accelerate token generation processes.

Overview of Software Frameworks

The paper also presents a comparative analysis of several cutting-edge open-source LLM serving systems, such as FasterTransformer, vLLM, and TensorRT-LLM, along with their specific optimizations and areas of focus. These frameworks encapsulate various algorithmic and system-level techniques discussed, serving as practical implementations for efficient LLM deployment.

Future Directions

The paper acknowledges the ongoing evolution of LLM technologies and proposes several future research directions:

  • Hardware Accelerator Development: Emphasis on the co-design of hardware and software to exploit full potential efficiency gains.
  • Advanced Decoding Algorithms: Further exploration of speculative and parallel decoding techniques to balance quality and performance.
  • Long-sequence Optimization: Innovations in handling longer contexts to meet the demand of sophisticated LLM applications.
  • Alternative Architectures: Investigation into non-Transformer architectures like MLP-based models or recurrent units for potential efficiency improvements.
  • Complex Deployment Environments: Strategies for deploying LLMs across diverse environments including edge, hybrid, and decentralized systems, addressing unique challenges associated with each.

Conclusion

This comprehensive survey by Miao et al. provides valuable insights into the current methodologies and future directions for efficient generative LLM serving. By systematically analyzing both algorithmic and system-level strategies, the paper offers a robust foundation for ongoing research and development, aimed at overcoming the inherent challenges of deploying large-scale LLMs in real-world applications. The continuous integration of these optimizations will be pivotal in enhancing system performance, facilitating broader accessibility and practical use of advanced AI technologies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (315)
  1. 2020. NVIDIA Effective Transformer. https://github.com/bytedance/effective_transformer. Commit: e406421, Accessed on: 2023-11-25.
  2. 2021. NVIDIA FasterTransformer. https://github.com/NVIDIA/FasterTransformer. Commit: df4a753, Accessed on: 2023-11-25.
  3. 2022. DeepSpeed Inference. https://github.com/microsoft/DeepSpeed. Commit: 2afa1c7, Accessed on: 2023-11-25.
  4. 2022. NVIDIA H100 Tensor Core GPU Architecture. https://resources.nvidia.com/en-us-tensor-core/gtc22-whitepaper-hopper. Accessed on: 2023-11-25.
  5. 2023. AnyScale LLMPerf leaderboard. https://github.com/ray-project/llmperf-leaderboard. Accessed on: 2023-12-23.
  6. 2023. AWS Inferentia. https://aws.amazon.com/blogs/machine-learning/deploy-large-language-models-on-aws-inferentia2-using-large-model-inference-containers/.
  7. 2023. ChatGLM2-6B. https://huggingface.co/THUDM/chatglm2-6b.
  8. 2023. CTranslate2. https://github.com/OpenNMT/CTranslate2. Commit: d963499, Accessed on: 2023-11-25.
  9. 2023a. DeepSpeed-FastGen. https://github.com/microsoft/DeepSpeed/tree/master/blogs/deepspeed-fastgen. Accessed on: 2023-11-25.
  10. 2023. DeepSpeed-Inference v.s. ZeRO-Inference. https://github.com/microsoft/DeepSpeed/issues/4234. Accessed on: 2023-11-25.
  11. 2023b. DeepSpeed-MII. https://github.com/microsoft/DeepSpeed-MII. Commit: f34b772, Accessed on: 2023-11-25.
  12. 2023a. FlexFlow-Serve. https://github.com/Flexflow/FlexFlow/tree/inference. Commit: 672cdad, Accessed on: 2023-11-25.
  13. 2023b. FlexGen. https://github.com/FMInference/FlexGen. Commit: d34f7b4, Accessed on: 2023-11-25.
  14. 2023. ggml. https://github.com/ggerganov/ggml. Commit: a5e4560, Accessed on: 2023-11-25.
  15. 2023. gpt-fast. https://github.com/pytorch-labs/gpt-fast. Commit: 8c8c463, Accessed on: 2023-12-23.
  16. 2023. Graphcore. https://www.graphcore.ai/posts/dolly-2.0-open-source-language-model-with-chatgpt-like-interactivity.
  17. 2023. Graphcore PopTransformer. https://github.com/graphcore/PopTransformer. Commit: 1314598, Accessed on: 2023-11-25.
  18. 2023. Huggingface Text Generation Inference. https://github.com/huggingface/text-generation-inference. Commit: 3c02262, Accessed on: 2023-11-25.
  19. 2023. Intel Extension for Transformers. https://github.com/intel/intel-extension-for-transformers. Commit: 37d4007, Accessed on: 2023-12-23.
  20. 2023. InterLM LMDeploy. https://github.com/InternLM/lmdeploy. Commit: c07f60f, Accessed on: 2023-11-25.
  21. 2023. LightLLM. https://github.com/ModelTC/lightllm. Commit: 84671a7, Accessed on: 2023-11-25.
  22. 2023. Llama-v2-7b benchmark. https://hamel.dev/notes/llm/inference/03_inference.html. Accessed on: 2023-11-25.
  23. 2023. NVIDIA cuDNN MultiHeadAttn. https://docs.nvidia.com/deeplearning/cudnn/api/index.html##cudnnMultiHeadAttnForward. Accessed on: 2023-11-25.
  24. 2023. NVIDIA CUTLASS. https://github.com/NVIDIA/cutlass. Commit: b5d8a5d, Accessed on: 2023-11-25.
  25. 2023. NVIDIA TensorRT-LLM. https://github.com/NVIDIA/TensorRT-LLM. Commit: 6837c81, Accessed on: 2023-11-25.
  26. 2023. OpenLLM. https://github.com/bentoml/OpenLLM. Commit: b4ea4b3, Accessed on: 2023-11-25.
  27. 2023. RayLLM. https://github.com/ray-project/ray-llm. Commit: fa3a766, Accessed on: 2023-11-25.
  28. 2023. Sambanova. https://sambanova.ai/press/sambanova-unveils-new-chip-the-sn40l/.
  29. 2023. vLLM. https://github.com/vllm-project/vllm. Commit: 7c60044, Accessed on: 2023-11-25.
  30. 2023. Xorbits Inference (Xinference). https://github.com/xorbitsai/inference. Commit: 22732d8, Accessed on: 2023-11-25.
  31. SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills. arXiv preprint arXiv:2308.16369 (2023).
  32. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. arXiv preprint arXiv:2305.13245 (2023).
  33. Batch: machine learning inference serving on serverless platforms with adaptive batching. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–15.
  34. LLM in a flash: Efficient Large Language Model Inference with Limited Memory. arXiv preprint arXiv:2312.11514 (2023).
  35. Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale. arXiv preprint arXiv:2207.00032 (2022).
  36. Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers. arXiv preprint arXiv:2305.15805 (2023).
  37. Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding. arXiv preprint arXiv:2310.05424 (2023).
  38. LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding. arXiv preprint arXiv:2308.14508 (2023).
  39. {{\{{PipeSwitch}}\}}: Fast pipelined context switching for deep learning applications. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 499–514.
  40. Peter Belcak and Roger Wattenhofer. 2023. Exponentially Faster Language Modelling. arXiv preprint arXiv:2311.10770 (2023).
  41. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150 (2020).
  42. Tal Ben-Nun and Torsten Hoefler. 2019. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. ACM Computing Surveys (CSUR) 52, 4 (2019), 1–43.
  43. Improving language models by retrieving from trillions of tokens. In International conference on machine learning. PMLR, 2206–2240.
  44. Petals: Collaborative inference and fine-tuning of large models. arXiv preprint arXiv:2209.01188 (2022).
  45. Distributed Inference and Fine-tuning of Large Language Models Over The Internet. arXiv preprint arXiv:2312.08361 (2023).
  46. Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers. arXiv preprint arXiv:2311.10642 (2023).
  47. F Warren Burton. 1985. Speculative computation, parallelism, and functional programming. IEEE Trans. Comput. 100, 12 (1985), 1190–1193.
  48. Medusa: Simple framework for accelerating llm generation with multiple decoding heads. https://github.com/FasterDecoding/Medusa. Commit: dd9c8a5, Accessed on: 2023-11-25.
  49. DSFormer: Effective Compression of Text-Transformers by Dense-Sparse Weight Factorization. arXiv preprint arXiv:2312.13211 (2023).
  50. Carol Chen. 2022. Transformer Inference Arithmetic. https://kipp.ly/blog/transformer-inference-arithmetic/. Accessed on: 2023-11-25.
  51. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318 (2023).
  52. Punica: Multi-Tenant LoRA Serving. arXiv preprint arXiv:2310.18547 (2023).
  53. FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance. arXiv preprint arXiv:2305.05176 (2023).
  54. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
  55. Et: re-thinking self-attention for transformer models on gpus. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–18.
  56. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595 (2023).
  57. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588 (2022).
  58. Adapting Language Models to Compress Contexts. arXiv preprint arXiv:2305.14788 (2023).
  59. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys.org/blog/2023-03-30-vicuna/
  60. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509 (2019).
  61. Accelerating transformer networks through recomposing softmax layers. In 2022 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 92–103.
  62. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022).
  63. Adaptively Sparse Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2174–2184.
  64. SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention. arXiv preprint arXiv:2312.07987 (2023).
  65. LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking. arXiv preprint arXiv:2308.04945 (2023).
  66. Databricks. 2023. LLM Inference Performance Engineering: Best Practices. https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices Accessed on: 2023-11-25.
  67. Language modeling with gated convolutional networks. In International conference on machine learning. PMLR, 933–941.
  68. SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference. arXiv preprint arXiv:2307.02628 (2023).
  69. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. arXiv preprint arXiv:2208.07339 (2022).
  70. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314 (2023).
  71. SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression. arXiv preprint arXiv:2306.03078 (2023).
  72. Tim Dettmers and Luke Zettlemoyer. 2022. The case for 4-bit precision: k-bit Inference Scaling Laws. arXiv preprint arXiv:2212.09720 (2022).
  73. Cerebras-GPT: Open compute-optimal language models trained on the Cerebras wafer-scale cluster. arXiv preprint arXiv:2304.03208 (2023).
  74. Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486 (2023).
  75. Towards Next-Generation Intelligent Assistants Leveraging LLM Techniques. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 5792–5793.
  76. Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs. ACM Transactions on Architecture and Code Optimization 20, 4 (2023), 1–22.
  77. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning. PMLR, 5547–5569.
  78. A Comprehensive Performance Study of Large Language Models on Novel AI Accelerators. arXiv preprint arXiv:2310.04607 (2023).
  79. LLMCarbon: Modeling the end-to-end Carbon Footprint of Large Language Models. arXiv preprint arXiv:2309.14393 (2023).
  80. Reducing Transformer Depth on Demand with Structured Dropout. In International Conference on Learning Representations.
  81. Hierarchical Neural Story Generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 889–898.
  82. Turbotransformers: an efficient gpu serving system for transformer models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 389–402.
  83. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research 23, 1 (2022), 5232–5270.
  84. The CoRa tensor compiler: Compilation for ragged tensors with minimal padding. Proceedings of Machine Learning and Systems 4 (2022), 721–747.
  85. Extending Context Window of Large Language Models via Semantic Compression. arXiv preprint arXiv:2312.09571 (2023).
  86. Tensorir: An abstraction for automatic tensorized program optimization. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 804–817.
  87. Elias Frantar and Dan Alistarh. 2023. SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot. (2023).
  88. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323 (2022).
  89. OPTQ: Accurate quantization for generative pre-trained transformers. In The Eleventh International Conference on Learning Representations.
  90. Compiling machine learning programs via high-level tracing. Systems for Machine Learning 4, 9 (2018).
  91. Hungry Hungry Hippos: Towards Language Modeling with State Space Models. In The Eleventh International Conference on Learning Representations.
  92. Breaking the Sequential Dependency of LLM Inference Using Lookahead Decoding. https://lmsys.org/blog/2023-11-21-lookahead-decoding/
  93. MegaBlocks: Efficient Sparse Training with Mixture-of-Experts. Proceedings of Machine Learning and Systems 5 (2023).
  94. Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs. In Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@ NeurIPS 2023).
  95. In-context autoencoder for context compression in a large language model. arXiv preprint arXiv:2307.06945 (2023).
  96. Lossless acceleration for Seq2seq generation with aggressive decoding. arXiv preprint arXiv:2205.10350 (2022).
  97. Mask-Predict: Parallel Decoding of Conditional Masked Language Models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 6112–6121.
  98. Semi-autoregressive training improves mask-predict decoding. arXiv preprint arXiv:2001.08785 (2020).
  99. A survey of quantization methods for efficient neural network inference. In Low-Power Computer Vision. Chapman and Hall/CRC, 291–326.
  100. Prompt Cache: Modular Attention Reuse for Low-Latency Inference. arXiv preprint arXiv:2311.04934 (2023).
  101. PoWER-BERT: Accelerating BERT inference via progressive word-vector elimination. In International Conference on Machine Learning. PMLR, 3690–3699.
  102. Albert Gu and Tri Dao. 2023. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv preprint arXiv:2312.00752 (2023).
  103. Efficiently Modeling Long Sequences with Structured State Spaces. In International Conference on Learning Representations.
  104. Non-autoregressive neural machine translation. In International Conference on Learning Representations (ICLR).
  105. Jiatao Gu and Xiang Kong. 2021. Fully Non-autoregressive Neural Machine Translation: Tricks of the Trade. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 120–133.
  106. Knowledge Distillation of Large Language Models. arXiv preprint arXiv:2306.08543 (2023).
  107. Cocktail: A multidimensional optimization for model serving in cloud. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). 1041–1057.
  108. Non-autoregressive neural machine translation with enhanced decoder input. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 3723–3730.
  109. STI: Turbocharge NLP Inference at the Edge via Elastic Pipelining. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 791–803.
  110. Star-Transformer. In Proceedings of NAACL-HLT. 1315–1325.
  111. Ankit Gupta and Jonathan Berant. 2020. Gmat: Global memory augmentation for transformers. arXiv preprint arXiv:2006.03274 (2020).
  112. Memory-efficient Transformers via Top-k Attention. In Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing. 39–52.
  113. Manish Gupta and Puneet Agrawal. 2022. Compression of deep learning models for text: A survey. ACM Transactions on Knowledge Discovery from Data (TKDD) 16, 4 (2022), 1–55.
  114. Microsecond-scale preemption for concurrent {{\{{GPU-accelerated}}\}}{{\{{DNN}}\}} inferences. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 539–558.
  115. Bobby He and Thomas Hofmann. 2023. Simplifying Transformer Blocks. arXiv preprint arXiv:2311.01906 (2023).
  116. FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 120–134.
  117. Magic pyramid: Accelerating inference with early exiting and token pruning. arXiv preprint arXiv:2111.00230 (2021).
  118. REST: Retrieval-Based Speculative Decoding. arXiv preprint arXiv:2311.08252 (2023).
  119. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751 (2019).
  120. FlashDecoding++: Faster Large Language Model Inference on GPUs. arXiv preprint arXiv:2311.01282 (2023).
  121. SPEED: Speculative Pipelined Execution for Efficient Decoding. arXiv preprint arXiv:2310.12072 (2023).
  122. Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference. arXiv preprint arXiv:2303.06182 (2023).
  123. Tutel: Adaptive mixture-of-experts at scale. Proceedings of Machine Learning and Systems 5 (2023).
  124. Calculon: a methodology and tool for high-level co-design of systems and large language models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14.
  125. GPT-Zip: Deep Compression of Finetuned Large Language Models. In Workshop on Efficient Systems for Foundation Models@ ICML2023.
  126. Compressing LLMs: The Truth is Rarely Pure and Never Simple. arXiv preprint arXiv:2310.01382 (2023).
  127. TASO: optimizing deep learning computation with automatic generation of graph substitutions. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 47–62.
  128. Beyond Data and Model Parallelism for Deep Neural Networks. Proceedings of Machine Learning and Systems 1 (2019), 1–13.
  129. Mistral 7B. arXiv preprint arXiv:2310.06825 (2023).
  130. Llmlingua: Compressing prompts for accelerated inference of large language models. arXiv preprint arXiv:2310.05736 (2023).
  131. LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression. arXiv preprint arXiv:2310.06839 (2023).
  132. HexGen: Generative Inference of Foundation Model over Heterogeneous Decentralized Environment. arXiv preprint arXiv:2311.11514 (2023).
  133. TinyBERT: Distilling BERT for Natural Language Understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020. 4163–4174.
  134. S33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT: Increasing GPU Utilization during Generative Inference for Higher Throughput. arXiv preprint arXiv:2306.06000 (2023).
  135. A Jo. 2023. The promise and peril of generative AI. Nature 614, 1 (2023), 214–216.
  136. Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. In Proceedings of the 50th Annual International Symposium on Computer Architecture. 1–14.
  137. Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine Translation. In International Conference on Learning Representations.
  138. MLIR-based code generation for GPU tensor cores. In Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction. 117–128.
  139. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning. PMLR, 5156–5165.
  140. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858 (2019).
  141. Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 284–294.
  142. Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization. arXiv preprint arXiv:2305.14152 (2023).
  143. SqueezeLLM: Dense-and-Sparse Quantization. arXiv preprint arXiv:2306.07629 (2023).
  144. Full stack optimization of transformer inference: a survey. arXiv preprint arXiv:2302.14017 (2023).
  145. Big little transformer decoder. arXiv preprint arXiv:2302.07863 (2023).
  146. Reformer: The Efficient Transformer. In International Conference on Learning Representations.
  147. Accelerating Inference for Pretrained Language Models by Unified Multi-Perspective Early Exiting. In Proceedings of the 29th International Conference on Computational Linguistics. 4677–4686.
  148. Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference. In Findings of the Association for Computational Linguistics: EMNLP 2021. 3577–3599.
  149. Ziplm: Hardware-aware structured pruning of language models. arXiv preprint arXiv:2302.04089 (2023).
  150. Efficient Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the 29th Symposium on Operating Systems Principles. 611–626.
  151. Relax: Composable Abstractions for End-to-End Dynamic Machine Learning. arXiv preprint arXiv:2311.02103 (2023).
  152. Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  153. xFormers: A modular and hackable Transformer modelling library. https://github.com/facebookresearch/xformers. Commit: fbf349a, Accessed on: 2023-11-25.
  154. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. In International Conference on Learning Representations.
  155. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning. PMLR, 19274–19286.
  156. Accelerating Distributed {{\{{MoE}}\}} Training and Inference with Lina. In 2023 USENIX Annual Technical Conference (USENIX ATC 23). 945–959.
  157. Cascadebert: Accelerating inference of pre-trained language models via calibrated complete models cascade. arXiv preprint arXiv:2012.14682 (2020).
  158. A Speed Odyssey for Deployable Quantization of LLMs. arXiv preprint arXiv:2311.09550 (2023).
  159. Compressing Context to Enhance Inference Efficiency of Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 6342–6353.
  160. An efficient transformer decoder with compressed sub-layers. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 13315–13323.
  161. LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation. arXiv preprint arXiv:2306.11222 (2023).
  162. {{\{{AlpaServe}}\}}: Statistical Multiplexing with Model Parallelism for Deep Learning Serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). 663–679.
  163. A global past-future early exit method for accelerating inference of pre-trained language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2013–2023.
  164. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv preprint arXiv:2306.00978 (2023).
  165. Ring Attention with Blockwise Transformers for Near-Infinite Context. arXiv preprint arXiv:2310.01889 (2023).
  166. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172 (2023).
  167. FastBERT: a Self-distilling BERT with Adaptive Inference Time. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 6035–6044.
  168. Online Speculative Decoding. arXiv preprint arXiv:2310.07177 (2023).
  169. CacheGen: Fast Context Loading for Language Model Applications. arXiv preprint arXiv:2310.07240 (2023).
  170. Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time. arXiv preprint arXiv:2305.17118 (2023).
  171. LLM-QAT: Data-Free Quantization Aware Training for Large Language Models. arXiv preprint arXiv:2305.17888 (2023).
  172. Deja vu: Contextual sparsity for efficient llms at inference time. In International Conference on Machine Learning. PMLR, 22137–22176.
  173. BumbleBee: Secure Two-party Inference Framework for Large Transformers. Cryptology ePrint Archive (2023).
  174. LLM-Pruner: On the Structural Pruning of Large Language Models. arXiv preprint arXiv:2305.11627 (2023).
  175. Ruben Mayer and Hans-Arno Jacobsen. 2020. Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools. ACM Computing Surveys (CSUR) 53, 1 (2020), 1–37.
  176. Long Range Language Modeling via Gated State Spaces. In The Eleventh International Conference on Learning Representations.
  177. SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification. arXiv preprint arXiv:2305.09781 (2023).
  178. SpotServe: Serving Generative Large Language Models on Preemptible Instances. Proceedings of ASPLOS Conference (2024).
  179. Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism. Proc. VLDB Endow. 16, 3 (2023), 470–479. https://doi.org/10.14778/3570690.3570697
  180. Are sixteen heads really better than one? Advances in neural information processing systems 32 (2019).
  181. Maxim Milakov and Natalia Gimelshein. 2018. Online normalizer calculation for softmax. arXiv preprint arXiv:1805.02867 (2018).
  182. Accelerating sparse deep neural networks. arXiv preprint arXiv:2104.08378 (2021).
  183. Adapler: Speeding up inference by adaptive length reduction. arXiv preprint arXiv:2203.08991 (2022).
  184. Amirkeivan Mohtashami and Martin Jaggi. 2023. Landmark Attention: Random-Access Infinite Context Length for Transformers. arXiv preprint arXiv:2305.16300 (2023).
  185. PaSS: Parallel Speculative Sampling. arXiv preprint arXiv:2311.13581 (2023).
  186. Learning to compress prompts with gist tokens. arXiv preprint arXiv:2304.08467 (2023).
  187. Generating benchmarks for factuality evaluation of language models. arXiv preprint arXiv:2307.06908 (2023).
  188. Kabir Nagrecha and Arun Kumar. 2023. Saturn: An Optimized Data System for Large Model Deep Learning Workloads. arXiv preprint arXiv:2309.01226 (2023).
  189. Memory-efficient pipeline-parallel dnn training. In International Conference on Machine Learning. PMLR, 7937–7947.
  190. Cheaply Estimating Inference Efficiency Metrics for Autoregressive Transformer Models. In Thirty-seventh Conference on Neural Information Processing Systems.
  191. Paella: Low-latency Model Serving with Software-defined GPU Scheduling. In Proceedings of the 29th Symposium on Operating Systems Principles. 595–610.
  192. Evomoe: An evolutional mixture-of-experts training framework via dense-to-sparse gate. arXiv preprint arXiv:2112.14397 (2021).
  193. FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement. Proceedings of the ACM on Management of Data 1, 1 (2023), 1–19.
  194. The statistical recurrent unit. In International Conference on Machine Learning. PMLR, 2671–2680.
  195. OpenAI. 2023. GPT-4 Technical Report. CoRR abs/2303.08774 (2023). https://doi.org/10.48550/arXiv.2303.08774 arXiv:2303.08774
  196. Resurrecting recurrent neural networks for long sequences. arXiv preprint arXiv:2303.06349 (2023).
  197. MemGPT: Towards LLMs as Operating Systems. arXiv preprint arXiv:2310.08560 (2023).
  198. Faster Causal Attention Over Large Sequences Through Sparse Flash Attention. arXiv preprint arXiv:2306.01160 (2023).
  199. nuqmm: Quantized matmul for efficient inference of large-scale generative language models. arXiv preprint arXiv:2206.09557 (2022).
  200. RWKV: Reinventing RNNs for the Transformer Era. arXiv preprint arXiv:2305.13048 (2023).
  201. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277 (2023).
  202. Chiplet Cloud: Building AI Supercomputers for Serving Large Generative Language Models. arXiv preprint arXiv:2307.02666 (2023).
  203. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems 5 (2023).
  204. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. In International Conference on Learning Representations.
  205. Qualcomm. 2023. The future of AI is hybrid. https://www.qualcomm.com/content/dam/qcomm-martech/dm-assets/documents/Whitepaper-The-future-of-AI-is-hybrid-Part-2-Qualcomm-is-uniquely-positioned-to-scale-hybrid-AI.pdf. Accessed on: 2023-11-25.
  206. Markus N Rabe and Charles Staats. 2021. Self-attention Does Not Need O(n2superscript𝑛2n^{2}italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) Memory. arXiv preprint arXiv:2112.05682 (2021).
  207. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. In International Conference on Machine Learning. PMLR, 18332–18346.
  208. Zero-shot text-to-image generation. In International Conference on Machine Learning. PMLR, 8821–8831.
  209. Mlperf inference benchmark. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 446–459.
  210. Hash layers for large sparse models. Advances in Neural Information Processing Systems 34 (2021), 17555–17566.
  211. Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics 9 (2021), 53–68.
  212. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. (2014).
  213. Apache TVM Unity: a vision for the ML software and hardware ecosystem.
  214. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).
  215. Movement pruning: Adaptive sparsity by fine-tuning. Advances in Neural Information Processing Systems 33 (2020), 20378–20389.
  216. What Matters In The Structured Pruning of Generative Language Models? arXiv preprint arXiv:2302.03773 (2023).
  217. Accelerating Transformer Inference for Translation via Parallel Decoding. arXiv preprint arXiv:2305.10427 (2023).
  218. Memory Augmented Language Models through Mixture of Word Experts. arXiv preprint arXiv:2311.10768 (2023).
  219. Consistent Accelerated Inference via Confident Adaptive Transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 4962–4979.
  220. Noam Shazeer. 2019. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150 (2019).
  221. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 (2017).
  222. Efficient LLM Inference on CPUs. arXiv preprint arXiv:2311.00502 (2023).
  223. S-LoRA: Serving Thousands of Concurrent LoRA Adapters. arXiv preprint arXiv:2311.03285 (2023).
  224. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA (Proceedings of Machine Learning Research, Vol. 202). PMLR, 31094–31116.
  225. Xing Shi and Kevin Knight. 2017. Speeding up neural machine translation decoding by shrinking run-time vocabulary. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 574–579.
  226. Welder: Scheduling Deep Learning Memory Access via Tile-graph. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). 701–718.
  227. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019).
  228. PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU. arXiv preprint arXiv:2312.12456 (2023).
  229. Benjamin Spector and Chris Re. 2023. Accelerating llm inference with staged speculative decoding. arXiv preprint arXiv:2308.04623 (2023).
  230. Blockwise parallel decoding for deep autoregressive models. Advances in Neural Information Processing Systems 31 (2018).
  231. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864 (2021).
  232. A Simple and Effective Pruning Approach for Large Language Models. arXiv preprint arXiv:2306.11695 (2023).
  233. Patient Knowledge Distillation for BERT Model Compression. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 4323–4332.
  234. A simple hash-based early exiting approach for language understanding and generation. arXiv preprint arXiv:2203.01670 (2022).
  235. Retentive Network: A Successor to Transformer for Large Language Models. arXiv preprint arXiv:2307.08621 (2023).
  236. Spectr: Fast speculative decoding via optimal transport. In Workshop on Efficient Systems for Foundation Models@ ICML2023.
  237. FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs. arXiv preprint arXiv:2309.01172 (2023).
  238. Stanford alpaca: An instruction-following llama model.
  239. Sparse sinkhorn attention. In International Conference on Machine Learning. PMLR, 9438–9447.
  240. Efficient Transformers: A Survey. ACM Comput. Surv. 55, 6 (2023), 109:1–109:28. https://doi.org/10.1145/3530811
  241. DeciAI Research Team. 2023. DeciLM 6B. [https://huggingface.co/Deci/DeciLM-6b](https://huggingface.co/Deci/DeciLM-6b)
  242. MLC team. 2023. MLC-LLM. https://github.com/mlc-ai/mlc-llm Commit: 3358029, Accessed on: 2023-11-25.
  243. Branchynet: Fast inference via early exiting from deep neural networks. In 2016 23rd international conference on pattern recognition (ICPR). IEEE, 2464–2469.
  244. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. 10–19.
  245. Mlp-mixer: An all-mlp architecture for vision. Advances in neural information processing systems 34 (2021), 24261–24272.
  246. AutoML in the Age of Large Language Models: Current Challenges, Future Opportunities and Risks. arXiv preprint arXiv:2306.08107 (2023).
  247. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
  248. Efficient methods for natural language processing: A survey. Transactions of the Association for Computational Linguistics 11 (2023), 826–860.
  249. Francisco Massa Grigory Sizov Tri Dao, Daniel Haziza. [n. d.]. Flash-Decoding for long-context inference, year = 2023, url = https://pytorch.org/blog/flash-decoding/,.
  250. Unity: Accelerating {{\{{DNN}}\}} training through joint optimization of algebraic transformations and parallelization. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 267–284.
  251. Mini-GPTs: Efficient Large Language Models through Contextual Pruning. arXiv preprint arXiv:2312.12682 (2023).
  252. Robert A Van De Geijn and Jerrell Watts. 1997. SUMMA: Scalable universal matrix multiplication algorithm. Concurrency: Practice and Experience 9, 4 (1997), 255–274.
  253. Attention is all you need. Advances in neural information processing systems 30 (2017).
  254. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768 (2020).
  255. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems 33 (2020), 5776–5788.
  256. LightSeq: A high performance inference library for transformers. arXiv preprint arXiv:2010.13887 (2020).
  257. Tabi: An Efficient Multi-Level Inference System for Large Language Models. In Proceedings of the Eighteenth European Conference on Computer Systems. 233–248.
  258. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
  259. {{\{{MLaaS}}\}} in the wild: Workload analysis and scheduling in {{\{{Large-Scale}}\}} heterogeneous {{\{{GPU}}\}} clusters. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). 945–960.
  260. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100 (2022).
  261. Fast Distributed Inference Serving for Large Language Models. arXiv preprint arXiv:2305.05920 (2023).
  262. TenTrans High-Performance Inference Toolkit for WMT2021 Efficiency Task. In Proceedings of the Sixth Conference on Machine Translation. 795–798.
  263. Speeding up Transformer Decoding via an Attention Refinement Network. In Proceedings of the 29th International Conference on Computational Linguistics. 5109–5118.
  264. Peng Wu. 2023. PyTorch 2.0: The Journey to Bringing Compiler Technologies to the Core of PyTorch (Keynote). In Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization. 1–1.
  265. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155 (2023).
  266. Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases. arXiv preprint arXiv:2301.12017 (2023).
  267. Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity. arXiv preprint arXiv:2309.10285 (2023).
  268. Smoothquant: Accurate and efficient post-training quantization for large language models. arXiv preprint arXiv:2211.10438 (2022).
  269. Efficient Streaming Language Models with Attention Sinks. arXiv preprint arXiv:2309.17453 (2023).
  270. Sharing Attention Weights for Fast Transformer. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, Sarit Kraus (Ed.). ijcai.org, 5292–5298. https://doi.org/10.24963/ijcai.2019/735
  271. A survey on non-autoregressive generation for neural machine translation and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
  272. DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2246–2251.
  273. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023).
  274. LLMCad: Fast and Scalable On-device Large Language Model Inference. arXiv preprint arXiv:2309.04255 (2023).
  275. Retrieval meets Long Context Large Language Models. arXiv preprint arXiv:2310.03025 (2023).
  276. Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt. arXiv preprint arXiv:2305.11186 (2023).
  277. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305 (2023).
  278. Inference with reference: Lossless acceleration of large language models. arXiv preprint arXiv:2304.04487 (2023).
  279. Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding. arXiv preprint arXiv:2307.05908 (2023).
  280. A comprehensive study on post-training quantization for large language models. arXiv preprint arXiv:2303.08302 (2023).
  281. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in Neural Information Processing Systems 35 (2022), 27168–27183.
  282. TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 5798–5809.
  283. SparseTIR: Composable abstractions for sparse compilation in deep learning. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. 660–678.
  284. Anil Yemme and Shayan Srinivasa Garani. 2023. A Scalable GPT-2 Inference Hardware Architecture on FPGA. In 2023 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–8.
  285. Orca: A Distributed Serving System for Transformer-Based Generative Models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 521–538.
  286. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10819–10829.
  287. RPTQ: Reorder-based Post-training Quantization for Large Language Models. arXiv preprint arXiv:2304.01089 (2023).
  288. LARGE LANGUAGE MODEL CASCADES WITH MIX-TURE OF THOUGHT REPRESENTATIONS FOR COST-EFFICIENT REASONING. arXiv preprint arXiv:2310.03094 (2023).
  289. Big bird: Transformers for longer sequences. Advances in neural information processing systems 33 (2020), 17283–17297.
  290. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414 (2022).
  291. Learning to Skip for Language Modeling. arXiv preprint arXiv:2311.15436 (2023).
  292. An attention free transformer. arXiv preprint arXiv:2105.14103 (2021).
  293. Bytetransformer: A high-performance transformer boosted for variable-length inputs. In 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 344–355.
  294. DePA: Improving Non-autoregressive Translation with Dependency-Aware Decoder. In Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023). 478–490.
  295. {{\{{MArk}}\}}: Exploiting Cloud Services for {{\{{Cost-Effective}}\}},{{\{{SLO-Aware}}\}} Machine Learning Inference Serving. In 2019 USENIX Annual Technical Conference (USENIX ATC 19). 1049–1062.
  296. Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding. arXiv preprint arXiv:2309.08168 (2023).
  297. Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models. arXiv preprint arXiv:2311.03687 (2023).
  298. LatticeGen: A Cooperative Framework which Hides Generated Text in a Lattice for Privacy-Aware Generation on Cloud. arXiv preprint arXiv:2309.17157 (2023).
  299. Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer. In Findings of the Association for Computational Linguistics: EMNLP 2023. 2775–2786.
  300. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022).
  301. H _⁢2_2\_2_ 2 O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models. arXiv preprint arXiv:2306.14048 (2023).
  302. Atom: Low-bit Quantization for Efficient and Accurate LLM Serving. arXiv preprint arXiv:2310.19102 (2023).
  303. Alpa: Automating inter-and {{\{{Intra-Operator}}\}} parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 559–578.
  304. {{\{{EINNET}}\}}: Optimizing Tensor Programs with {{\{{Derivation-Based}}\}} Transformations. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). 739–755.
  305. PIT: Optimization of Dynamic Sparse Deep Learning Models via Permutation Invariant Transformation. In Proceedings of the 29th Symposium on Operating Systems Principles. 331–347.
  306. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI conference on artificial intelligence, Vol. 35. 11106–11115.
  307. Transpim: A memory-based acceleration via software-hardware co-design for transformer. In 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 1071–1085.
  308. Bert loses patience: Fast and robust inference with early exit. Advances in Neural Information Processing Systems 33 (2020), 18330–18341.
  309. Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems 35 (2022), 7103–7114.
  310. DistillSpec: Improving Speculative Decoding via Knowledge Distillation. arXiv preprint arXiv:2310.08461 (2023).
  311. {{\{{PetS}}\}}: A Unified Framework for {{\{{Parameter-Efficient}}\}} Transformers Serving. In 2022 USENIX Annual Technical Conference (USENIX ATC 22). 489–504.
  312. On Optimal Caching and Model Multiplexing for Large Model Inference. arXiv preprint arXiv:2306.02003 (2023).
  313. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023).
  314. A survey on model compression for large language models. arXiv preprint arXiv:2308.07633 (2023).
  315. Falcon LLM: A New Frontier in Natural Language Processing. AC Investment Research Journal 220, 44 (2023).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Xupeng Miao (37 papers)
  2. Gabriele Oliaro (10 papers)
  3. Zhihao Zhang (61 papers)
  4. Xinhao Cheng (4 papers)
  5. Hongyi Jin (6 papers)
  6. Tianqi Chen (77 papers)
  7. Zhihao Jia (43 papers)
Citations (59)