Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Efficiency Spectrum of Large Language Models: An Algorithmic Survey (2312.00678v2)

Published 1 Dec 2023 in cs.CL

Abstract: The rapid growth of LLMs has been a driving force in transforming various domains, reshaping the artificial general intelligence landscape. However, the increasing computational and memory demands of these models present substantial challenges, hindering both academic research and practical applications. To address these issues, a wide array of methods, including both algorithmic and hardware solutions, have been developed to enhance the efficiency of LLMs. This survey delivers a comprehensive review of algorithmic advancements aimed at improving LLM efficiency. Unlike other surveys that typically focus on specific areas such as training or model compression, this paper examines the multi-faceted dimensions of efficiency essential for the end-to-end algorithmic development of LLMs. Specifically, it covers various topics related to efficiency, including scaling laws, data utilization, architectural innovations, training and tuning strategies, and inference techniques. This paper aims to serve as a valuable resource for researchers and practitioners, laying the groundwork for future innovations in this critical research area. Our repository of relevant references is maintained at url{https://github.com/tding1/Efficient-LLM-Survey}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (345)
  1. [n. d.]. Introducing ChatGPT. https://openai.com/blog/chatgpt
  2. [n. d.]. Introducing Claude 2.1. https://www.anthropic.com/index/claude-2-1
  3. [n. d.]. Introducing PyTorch Fully Sharded Data Parallel (FSDP) API. https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/
  4. [n. d.]. Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance. https://blog.research.google/2022/04/pathways-language-model-palm-scaling-to.html
  5. [n. d.]. Planning for AGI and beyond. https://openai.com/blog/planning-for-agi-and-beyond
  6. SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills. arXiv preprint arXiv:2308.16369 (2023).
  7. Nur Ahmed and Muntasir Wahed. 2020. The De-democratization of AI: Deep learning and the compute divide in artificial intelligence research. arXiv preprint arXiv:2010.15581 (2020).
  8. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. arXiv preprint arXiv:2305.13245 (2023).
  9. ETC: Encoding long and structured inputs in transformers. arXiv preprint arXiv:2004.08483 (2020).
  10. Revisiting neural scaling laws in language and vision. Advances in Neural Information Processing Systems 35 (2022), 22300–22312.
  11. Four types of learning curves. Neural Computation 4, 4 (1992), 605–618.
  12. Gpt4all: Training an assistant-style chatbot with large scale data distillation from gpt-3.5-turbo. GitHub (2023).
  13. Palm 2 technical report. arXiv preprint arXiv:2305.10403 (2023).
  14. Deep batch active learning by diverse, uncertain gradient lower bounds. arXiv preprint arXiv:1906.03671 (2019).
  15. Jimmy Ba and Rich Caruana. 2014. Do deep nets really need to be deep? NeurIPS 27 (2014).
  16. A tree-based statistical language model for natural language speech recognition. TASSP 37, 7 (1989), 1001–1008.
  17. Binarybert: Pushing the limit of bert quantization. arXiv preprint arXiv:2012.15701 (2020).
  18. Scaling laws in learning of classification tasks. Physical review letters 70, 20 (1993), 3167.
  19. Curriculum learning. In ICML. 41–48.
  20. Graph of Thoughts: Solving Elaborate Problems with Large Language Models. arXiv preprint arXiv:2308.09687 (2023).
  21. Maximizing parallelism in distributed training for huge neural networks. arXiv preprint arXiv:2105.14450 (2021).
  22. Active learning with clustering. In Active Learning and Experimental Design workshop In conjunction with AISTATS 2010. JMLR Workshop and Conference Proceedings, 127–139.
  23. A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326 (2015).
  24. Language models are few-shot learners. NeurIPS 33 (2020), 1877–1901.
  25. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712 (2023).
  26. Broken Neural Scaling Laws. In The Eleventh International Conference on Learning Representations.
  27. Zeroq: A novel zero shot quantization framework. In CVPR. 13169–13178.
  28. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109 (2023).
  29. Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. TNN 20, 3 (2009), 542–542.
  30. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005 (2013).
  31. MCC-KD: Multi-CoT Consistent Knowledge Distillation. arXiv preprint arXiv:2310.14747 (2023).
  32. Maybe Only 0.5% Data is Needed: A Preliminary Exploration of Low Training Data Instruction Tuning. arXiv preprint arXiv:2305.09246 (2023).
  33. Actnn: Reducing training memory footprint via 2-bit activation compressed training. In International Conference on Machine Learning. PMLR, 1803–1813.
  34. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
  35. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595 (2023).
  36. Orthant Based Proximal Stochastic Gradient Method for l1-Regularized Optimization. In ECML PKDD. 57–73.
  37. LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery. arXiv preprint arXiv:2310.18356 (2023).
  38. Only train once: A one-shot neural network training and pruning framework. Advances in Neural Information Processing Systems (2021), 19637–19651.
  39. Neural network compression via sparse optimization. arXiv preprint arXiv:2011.04868 (2020).
  40. Towards Automatic Neural Architecture Search within General Super-Networks. arXiv preprint arXiv:2305.18030 (2023).
  41. OTOv2: Automatic, Generic, User-Friendly. In The Eleventh International Conference on Learning Representations.
  42. Half-space proximal stochastic gradient method for group-sparsity regularized problem. arXiv preprint arXiv:2009.12078 (2020).
  43. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174 (2016).
  44. Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers. arXiv preprint arXiv:2303.01610 (2023).
  45. Pipeline MoE: A Flexible MoE Implementation with Pipeline Parallelism. arXiv preprint arXiv:2304.11414 (2023).
  46. Mask-guided vision transformer (mg-vit) for few-shot learning. arXiv preprint arXiv:2205.09995 (2022).
  47. DISCO: distilling counterfactuals with large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 5514–5528.
  48. Kerple: Kernelized relative positional embedding for length extrapolation. NeurIPS 35 (2022), 8386–8399.
  49. Dissecting transformer length extrapolation via the lens of receptive field analysis. In ACL. 13522–13537.
  50. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality.
  51. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509 (2019).
  52. NVIDIA A100 tensor core GPU: Performance and innovation. IEEE Micro 41, 2 (2021), 29–35.
  53. Rethinking attention with performers. ICLR (2021).
  54. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022).
  55. Deep reinforcement learning from human preferences. NeurIPS 30 (2017).
  56. Increasing Diversity While Maintaining Accuracy: Text Data Generation with Large Language Models and Human Interventions. arXiv preprint arXiv:2306.04140 (2023).
  57. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. ArXiv abs/2305.06500 (2023).
  58. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860 (2019).
  59. Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691 (2023).
  60. Monarch: Expressive structured matrices for efficient and accurate training. In ICML. 4690–4721.
  61. Flashattention: Fast and memory-efficient exact attention with io-awareness. NeurIPS 35 (2022), 16344–16359.
  62. Hungry hungry hippos: Towards language modeling with state space models. arXiv preprint arXiv:2212.14052 (2022).
  63. Learning fast algorithms for linear transforms using butterfly factorizations. In ICML. PMLR, 1517–1527.
  64. Lieven De Lathauwer. 2008. Decompositions of a higher-order tensor in block terms—Part II: Definitions and uniqueness. SIAM J. Matrix Anal. Appl. 30, 3 (2008), 1033–1066.
  65. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. NeurIPS 35 (2022), 30318–30332.
  66. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314 (2023).
  67. SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression. arXiv preprint arXiv:2306.03078 (2023).
  68. Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486 (2023).
  69. Sparse Low-rank Adaptation of Pre-trained Language Models. arXiv preprint arXiv:2311.11696 (2023).
  70. Sparsity-guided Network Design for Frame Interpolation. arXiv preprint arXiv:2209.04551 (2022).
  71. Cdfi: Compression-driven network design for frame interpolation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8001–8011.
  72. Glam: Efficient scaling of language models with mixture-of-experts. In ICML. PMLR, 5547–5569.
  73. Kronecker decomposition for gpt compression. arXiv preprint arXiv:2110.08152 (2021).
  74. Ronen Eldan and Yuanzhi Li. 2023. TinyStories: How Small Can Language Models Be and Still Speak Coherent English? arXiv preprint arXiv:2305.07759 (2023).
  75. Jeffrey L Elman. 1993. Learning and development in neural networks: The importance of starting small. Cognition 48, 1 (1993), 71–99.
  76. DAPPLE: A Pipelined Data Parallel Approach for Training Large Models. (2021).
  77. Depgraph: Towards any structural pruning. arXiv preprint arXiv:2301.12900 (2023).
  78. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research 23, 1 (2022), 5232–5270.
  79. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323 (2022).
  80. Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture. In NeurIPS.
  81. Deep bayesian active learning with image data. In ICML. PMLR, 1183–1192.
  82. A method for disambiguating word senses in a large corpus. Computers and the Humanities 26 (1992), 415–439.
  83. Making pre-trained language models better few-shot learners. arXiv preprint arXiv:2012.15723 (2020).
  84. Jonas Geiping and Tom Goldstein. 2023. Cramming: Training a Language Model on a single GPU in one day.. In ICML. PMLR, 11117–11143.
  85. Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies. TACL 9 (2021), 346–361.
  86. A survey of quantization methods for efficient neural network inference. In Low-Power Computer Vision. Chapman and Hall/CRC, 291–326.
  87. Daniel Gissin and Shai Shalev-Shwartz. 2019. Discriminative active learning. arXiv preprint arXiv:1907.06347 (2019).
  88. Knowledge distillation: A survey. IJCV 129 (2021), 1789–1819.
  89. On the parameterization and initialization of diagonal state space models. NeurIPS 35 (2022), 35971–35983.
  90. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396 (2021).
  91. Knowledge Distillation of Large Language Models. arXiv preprint arXiv:2306.08543 (2023).
  92. Textbooks Are All You Need. arXiv preprint arXiv:2306.11644 (2023).
  93. Parameter-Efficient Transfer Learning with Diff Pruning. In ACL-IJCNLP. 4884–4896.
  94. Diagonal state spaces are as effective as structured state spaces. NeurIPS 35 (2022), 22982–22994.
  95. Géza Györgyi and Naftali Tishby. 1990. Statistical theory of learning a rule. Neural networks and spin glasses (1990), 3–36.
  96. A^ 3: Accelerating attention mechanisms in neural networks with approximation. In HPCA. 328–341.
  97. ELSA: Hardware-software co-design for efficient, lightweight self-attention mechanism in neural networks. In ISCA. 692–705.
  98. Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377 (2018).
  99. How much does attention actually attend? Questioning the Importance of Attention in Pretrained Transformers. ACL Findings (2022).
  100. How good are gpt models at machine translation? a comprehensive evaluation. arXiv preprint arXiv:2302.09210 (2023).
  101. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701 (2020).
  102. Scaling laws for transfer. arXiv preprint arXiv:2102.01293 (2021).
  103. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409 (2017).
  104. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
  105. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
  106. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556 (2022).
  107. Parameter-efficient transfer learning for NLP. In ICML. PMLR, 2790–2799.
  108. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301 (2023).
  109. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
  110. Off to a good start: Using clustering to select the initial training set in active learning. (2010).
  111. Jie Huang and Kevin Chen-Chuan Chang. 2022. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403 (2022).
  112. Gpipe: Efficient training of giant neural networks using pipeline parallelism. NeurIPS 32 (2019).
  113. Output Sensitivity-Aware DETR Quantization. (2023).
  114. Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research 18, 1 (2017), 6869–6898.
  115. The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter. arXiv preprint arXiv:2306.03805 (2023).
  116. Frederick Jelinek. 1998. Statistical methods for speech recognition. MIT press.
  117. Bo Ji and Tianyi Chen. 2022. FSCNN: A Fast Sparse Convolution Neural Network Inference System. arXiv preprint arXiv:2212.08815 (2022).
  118. Beyond Data and Model Parallelism for Deep Neural Networks. Proceedings of Machine Learning and Systems 1 (2019), 1–13.
  119. Accelerating deep learning by focusing on the biggest losers. arXiv preprint arXiv:1910.00762 (2019).
  120. Lion: Adversarial Distillation of Closed-Source Large Language Model. arXiv preprint arXiv:2305.12870 (2023).
  121. Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351 (2019).
  122. Edit distance based curriculum learning for paraphrase generation. In ACL-IJCNLP Workshop. 229–234.
  123. A study of BFLOAT16 for deep learning training. arXiv preprint arXiv:1905.12322 (2019).
  124. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020).
  125. Angelos Katharopoulos and François Fleuret. 2017. Biased importance sampling for deep neural network training. arXiv preprint arXiv:1706.00043 (2017).
  126. Angelos Katharopoulos and François Fleuret. 2018. Not all samples are created equal: Deep learning with importance sampling. In International conference on machine learning. PMLR, 2525–2534.
  127. The Impact of Positional Encoding on Length Generalization in Transformers. NeurIPS (2023).
  128. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT. 4171–4186.
  129. I-bert: Integer-only bert quantization. In ICML. PMLR, 5506–5518.
  130. SqueezeLLM: Dense-and-Sparse Quantization. arXiv preprint arXiv:2306.07629 (2023).
  131. BPipe: Memory-Balanced Pipeline Parallelism for Training Large Language Models. In Proceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 202), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.). 16639–16653.
  132. Yoon Kim and Alexander M Rush. 2016. Sequence-level knowledge distillation. arXiv preprint arXiv:1606.07947 (2016).
  133. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  134. Batchbald: Efficient and diverse batch acquisition for deep bayesian active learning. NeurIPS 32 (2019).
  135. Reformer: The efficient transformer. ICLR (2021).
  136. Tom Kocmi and Ondrej Bojar. 2017. Curriculum learning and minibatch bucketing in neural machine translation. arXiv preprint arXiv:1707.09533 (2017).
  137. Large language models are zero-shot reasoners. NeurIPS 35 (2022), 22199–22213.
  138. Aran Komatsuzaki. 2019. One epoch is all you need. arXiv preprint arXiv:1906.06669 (2019).
  139. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems 5 (2023).
  140. Pipelined backpropagation at scale: training large models without batches. Proceedings of Machine Learning and Systems 3 (2021), 479–501.
  141. Revealing the Dark Secrets of BERT. In EMNLP-IJCNLP. 4365–4374.
  142. Self-paced learning for latent variable models. NeurIPS 23 (2010).
  143. Sparse Finetuning for Inference Acceleration of Large Language Models. arXiv preprint arXiv:2310.06927 (2023).
  144. vllm: Easy, fast, and cheap llm serving with pagedattention.
  145. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019).
  146. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In ICLR.
  147. What would elsa do? freezing layers during transformer fine-tuning. arXiv preprint arXiv:1911.03090 (2019).
  148. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 4 (2020), 1234–1240.
  149. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499 (2021).
  150. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668 (2020).
  151. Mining of Massive Datasets, Cambridge University Press, Cambridge.
  152. David D Lewis. 1995. A sequential algorithm for training text classifiers: Corrigendum and additional data. In Acm Sigir Forum, Vol. 29. ACM New York, NY, USA, 13–19.
  153. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In ACL. 7871–7880.
  154. Curriculum learning: A regularization method for efficient and stable billion-scale gpt model pre-training. (2021).
  155. The stability-efficiency dilemma: Investigating sequence length warmup for training GPT models. NeurIPS 35 (2022), 26736–26750.
  156. Symbolic Chain-of-Thought Distillation: Small Models Can Also” Think” Step-by-Step. arXiv preprint arXiv:2306.14050 (2023).
  157. Scaling distributed machine learning with the parameter server. In 11th USENIX Symposium on operating systems design and implementation (OSDI 14). 583–598.
  158. Sequence parallelism: Long sequence training from system perspective. arXiv preprint arXiv:2105.13120 (2021).
  159. Functional Interpolation for Relative Positions Improves Long Context Transformers. arXiv preprint arXiv:2310.04418 (2023).
  160. PyTorch distributed: experiences on accelerating data parallel training. Proceedings of the VLDB Endowment 13, 12 (2020), 3005–3018.
  161. Q-diffusion: Quantizing diffusion models. arXiv preprint arXiv:2302.04304 (2023).
  162. Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190 (2021).
  163. Textbooks Are All You Need II: phi-1.5 technical report. arXiv preprint arXiv:2309.05463 (2023).
  164. TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models. ([n. d.]).
  165. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv preprint arXiv:2306.00978 (2023).
  166. Hao Liu and Pieter Abbeel. 2023. Blockwise Parallel Transformer for Long Context Large Models. arXiv preprint arXiv:2305.19370 (2023).
  167. Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023).
  168. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys 55, 9 (2023), 1–35.
  169. Fastbert: a self-distilling bert with adaptive inference time. arXiv preprint arXiv:2004.02178 (2020).
  170. GPT understands, too. AI Open (2023).
  171. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
  172. LLM-QAT: Data-Free Quantization Aware Training for Large Language Models. arXiv preprint arXiv:2305.17888 (2023).
  173. MESA: boost ensemble imbalanced learning with meta-sampler. NeurIPS 33 (2020), 14463–14474.
  174. Jieyi Long. 2023. Large Language Model Guided Tree-of-Thought. arXiv preprint arXiv:2305.08291 (2023).
  175. Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).
  176. LightFormer: Light-weight Transformer Using SVD-based Weight Transfer and Parameter Sharing. In Findings of the Association for Computational Linguistics: ACL 2023. 10323–10335.
  177. LLM-Pruner: On the Structural Pruning of Large Language Models. arXiv preprint arXiv:2305.11627 (2023).
  178. A tensorized transformer for language modeling. Advances in neural information processing systems 32 (2019).
  179. Mega: moving average equipped gated attention. ICLR (2023).
  180. Low-resource interactive active labeling for fine-tuning language models. In Findings of EMNLP. 3230–3242.
  181. Active learning by acquiring contrastive examples. arXiv preprint arXiv:2109.03764 (2021).
  182. i⁢n⁢f⁢t⁢y𝑖𝑛𝑓𝑡𝑦inftyitalic_i italic_n italic_f italic_t italic_y-former: Infinite Memory Transformer. ACL (2022).
  183. Long range language modeling via gated state spaces. ICLR (2023).
  184. Mixed precision training. arXiv preprint arXiv:1710.03740 (2017).
  185. Recurrent neural network based language model.. In Interspeech, Vol. 2. Makuhari, 1045–1048.
  186. Extensions of recurrent neural network language model. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5528–5531.
  187. Swaroop Mishra and Bhavdeep Singh Sachdeva. 2020. Do we need to create big datasets to learn a task?. In SustaiNLP Workshop. 169–173.
  188. Scaling Data-Constrained Language Models. arXiv preprint arXiv:2305.16264 (2023).
  189. Multimodal contrastive learning with limoe: the language-image mixture of experts. NeurIPS 35 (2022), 9564–9576.
  190. INT-FP-QSim: Mixed Precision and Formats For Large Language Models and Vision Transformers. arXiv preprint arXiv:2307.03712 (2023).
  191. PipeDream: Generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 1–15.
  192. Memory-efficient pipeline-parallel dnn training. In International Conference on Machine Learning. PMLR, 7937–7947.
  193. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–15.
  194. Matan Ben Noach and Yoav Goldberg. 2020. Compressing pre-trained language models by matrix decomposition. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing. 884–889.
  195. Non-Uniform Step Size Quantization for Accurate Post-Training Quantization. In European Conference on Computer Vision. Springer, 658–673.
  196. OpenAI. 2023. GPT-4 Technical Report. ArXiv abs/2303.08774 (2023).
  197. Training language models to follow instructions with human feedback. NeurIPS 35 (2022), 27730–27744.
  198. Are NLP Models really able to Solve Simple Math Word Problems?. In ACL. ACL, 2080–2094.
  199. RWKV: Reinventing RNNs for the Transformer Era. arXiv preprint arXiv:2305.13048 (2023).
  200. Towards Efficient and Effective Adaptation of Large Language Models for Sequential Recommendation. arXiv preprint arXiv:2310.01612 (2023).
  201. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071 (2023).
  202. Deep contextualized word representations. In NAACL-HLT. 2227–2237.
  203. Adapterhub: A framework for adapting transformers. arXiv preprint arXiv:2007.07779 (2020).
  204. Mad-x: An adapter-based framework for multi-task cross-lingual transfer. arXiv preprint arXiv:2005.00052 (2020).
  205. Hyena hierarchy: Towards larger convolutional language models. arXiv preprint arXiv:2302.10866 (2023).
  206. Shortformer: Better language modeling using shorter inputs. arXiv preprint arXiv:2012.15832 (2020).
  207. Train short, test long: Attention with linear biases enables input length extrapolation. ICLR (2023).
  208. Using random undersampling to alleviate class imbalance on tweet sentiment data. In 2015 IEEE international conference on information reuse and integration. IEEE, 197–202.
  209. Improving language understanding by generative pre-training. (2018).
  210. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
  211. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446 (2021).
  212. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551.
  213. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research 21, 140 (2020), 1–67.
  214. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. In ICML. 18332–18346.
  215. Zero: Memory optimizations toward training trillion parameter models. In SC. 1–16.
  216. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In SIGKDD. 3505–3506.
  217. Subformer: Exploring weight sharing for parameter efficiency in generative transformers. arXiv preprint arXiv:2101.00234 (2021).
  218. ZeRO-Offload: Democratizing Billion-Scale Model Training. arXiv preprint arXiv:2101.06840 (2021).
  219. Joshua Robinson and David Wingate. 2022. Leveraging Large Language Models for Multiple Choice Question Answering. In ICLR.
  220. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550 (2014).
  221. A constructive prediction of the generalization error across scales. arXiv preprint arXiv:1909.12673 (2019).
  222. Ronald Rosenfeld. 2000. Two decades of statistical language modeling: Where do we go from here? IEEE 88, 8 (2000), 1270–1278.
  223. AdapterDrop: On the Efficiency of Adapters in Transformers. In EMNLP. 7930–7946.
  224. Randomized Positional Encodings Boost Length Generalization of Transformers. arXiv preprint arXiv:2305.16843 (2023).
  225. Promptmix: A class boundary augmentation method for large language model distillation. arXiv preprint arXiv:2310.14192 (2023).
  226. Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In 2013 IEEE international conference on acoustics, speech and signal processing. IEEE, 6655–6659.
  227. Gerard Salton and Chris Buckley. 1990. Improving retrieval performance by relevance feedback. Journal of the American society for information science 41, 4 (1990), 288–297.
  228. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).
  229. AMissing SECRET SAUCE. [n. d.]. OUTLIER WEIGHED LAYERWISE SPARSITY (OWL): AMissing SECRET SAUCE FOR PRUNING LLMS TO HIGH SPARSITY. ([n. d.]).
  230. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100 (2022).
  231. Ozan Sener and Silvio Savarese. 2017. Active learning for convolutional neural networks: A core-set approach. arXiv preprint arXiv:1708.00489 (2017).
  232. Burr Settles. 2009. Active learning literature survey. University of Wisconsin-Madison Department of Computer Sciences.
  233. Burr Settles. 2011. From theories to queries: Active learning in practice. In Active learning and experimental design workshop in conjunction with AISTATS 2010. JMLR Workshop and Conference Proceedings, 1–18.
  234. Burr Settles. 2012. Active Learning. In Synthesis Lectures on Artificial Intelligence and Machine Learning. 1–114.
  235. Statistical mechanics of learning from examples. Physical review A 45, 8 (1992), 6056.
  236. Murray Shanahan. 2022. Talking about large language models. arXiv preprint arXiv:2212.03551 (2022).
  237. One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models. arXiv preprint arXiv:2310.09499 (2023).
  238. Mesh-tensorflow: Deep learning for supercomputers. Advances in neural information processing systems 31 (2018).
  239. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. ICLR (2017).
  240. Noam Shazeer and Mitchell Stern. 2018. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning. PMLR, 4596–4604.
  241. On Efficient Training of Large-Scale Deep Learning Models: A Literature Review. arXiv preprint arXiv:2304.03589 (2023).
  242. Q-bert: Hessian based ultra low precision quantization of bert. In AAAI, Vol. 34. 8815–8821.
  243. Flan-MoE: Scaling Instruction-Finetuned Language Models with Sparse Mixture of Experts. arXiv preprint arXiv:2305.14705 (2023).
  244. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980 (2020).
  245. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 (2019).
  246. Aditya Siddhant and Zachary C Lipton. 2018. Deep bayesian active learning for natural language processing: Results of a large-scale empirical study. arXiv preprint arXiv:1808.05697 (2018).
  247. Metadata archaeology: Unearthing data subsets by leveraging training dynamics. arXiv preprint arXiv:2209.10015 (2022).
  248. Beyond neural scaling laws: beating power law scaling via data pruning. NeurIPS 35 (2022), 19523–19536.
  249. Selective annotation makes language models better few-shot learners. arXiv preprint arXiv:2209.01975 (2022).
  250. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864 (2021).
  251. Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355 (2023).
  252. Prioritizing samples in reinforcement learning with reducible loss. arXiv preprint arXiv:2208.10483 (2022).
  253. A Simple and Effective Pruning Approach for Large Language Models. arXiv preprint arXiv:2306.11695 (2023).
  254. Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621 (2023).
  255. Sparse attention with learning to hash. In ICLR.
  256. Kroneckerbert: Learning kronecker decomposition for pre-trained language models via knowledge distillation. arXiv preprint arXiv:2109.06243 (2021).
  257. CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. In ACL. ACL, Minneapolis, Minnesota, 4149–4158.
  258. Active learning for statistical natural language parsing. In ACL. 120–127.
  259. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca.
  260. Scaling laws vs model architectures: How does inductive bias influence scaling? arXiv preprint arXiv:2207.10551 (2022).
  261. Efficient Transformers: A Survey. ACM Comput. Surv. 55, 6, Article 109 (dec 2022), 28 pages.
  262. Scale Efficiently: Insights from Pretraining and Finetuning Transformers. In International Conference on Learning Representations.
  263. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239 (2022).
  264. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
  265. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
  266. Predicting attention sparsity in transformers. arXiv preprint arXiv:2109.12188 (2021).
  267. Efficient methods for natural language processing: A survey. TACL 11 (2023), 826–860.
  268. Aimee Van Wynsberghe. 2021. Sustainable AI: AI for sustainability and the sustainability of AI. AI and Ethics 1, 3 (2021), 213–218.
  269. Attention is all you need. NeurIPS 30 (2017).
  270. The role of artificial intelligence in achieving the Sustainable Development Goals. Nature communications 11, 1 (2020), 1–10.
  271. Self-paced learning for neural machine translation. arXiv preprint arXiv:2010.04505 (2020).
  272. Tesseract: Parallelize the tensor parallelism efficiently. In ICPP. 1–11.
  273. Uncertainty-based active learning for reading comprehension. TMLR (2022).
  274. Towards unified prompt tuning for few-shot text classification. arXiv preprint arXiv:2205.05313 (2022).
  275. Supporting Very Large Models using Automatic Dataflow Graph Partitioning. ([n. d.]).
  276. SCOTT: Self-consistent chain-of-thought distillation. arXiv preprint arXiv:2305.01879 (2023).
  277. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In ICLR.
  278. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560 (2022).
  279. Language Model Pre-training with Linguistically Motivated Curriculum Learning. (2022).
  280. Emergent Abilities of Large Language Models. TMLR (2022). Survey Certification.
  281. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS 35 (2022), 24824–24837.
  282. Ulme Wennberg and Gustav Eje Henter. 2021. The case for translation-invariant self-attention in transformer-based language models. arXiv preprint arXiv:2106.01950 (2021).
  283. CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data. In Proceedings of The 12th Language Resources and Evaluation Conference. 4003–4012.
  284. A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382 (2023).
  285. Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems 4 (2022), 795–813.
  286. Lamini-lm: A diverse herd of distilled models from large-scale instructions. arXiv preprint arXiv:2304.14402 (2023).
  287. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564 (2023).
  288. Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity. arXiv preprint arXiv:2309.10285 (2023).
  289. Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning. arXiv preprint arXiv:2310.06694 (2023).
  290. Smoothquant: Accurate and efficient post-training quantization for large language models. In ICML. PMLR, 38087–38099.
  291. Hidden state variability of pretrained language models can guide computation reduction for transfer learning. In EMNLP.
  292. Data selection for language models via importance resampling. arXiv preprint arXiv:2302.03169 (2023).
  293. Decomposition enhances reasoning via self-evaluation guided decoding. arXiv preprint arXiv:2305.00633 (2023).
  294. Petuum: A new platform for distributed machine learning on big data. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1335–1344.
  295. Canwen Xu and Julian McAuley. 2023. A survey on model compression and acceleration for pretrained language models. In AAAI, Vol. 37. 10566–10575.
  296. Qifan Xu and Yang You. 2023. An efficient 2d method for training super-large deep learning models. In 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 222–232.
  297. GSPMD: general and scalable parallelization for ML computation graphs. arXiv preprint arXiv:2105.04663 (2021).
  298. Making pre-trained language models end-to-end few-shot learners with contrastive prompt tuning. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining. 438–446.
  299. Representative sampling for text classification using support vector machines. In ECIR. 393–407.
  300. Pipemare: Asynchronous pipeline parallel dnn training. Proceedings of Machine Learning and Systems 3 (2021), 269–296.
  301. Automatic neural network compression by sparsity-quantization joint learning: A constrained optimization-based approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2178–2188.
  302. Harnessing the power of llms in practice: A survey on chatgpt and beyond. arXiv preprint arXiv:2304.13712 (2023).
  303. Exploring the limits of chatgpt for query or aspect-based text summarization. arXiv preprint arXiv:2302.08081 (2023).
  304. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 (2023).
  305. Nlp from scratch without large-scale pretraining: A simple and efficient framework. In International Conference on Machine Learning. PMLR, 25438–25451.
  306. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. NeurIPS 35 (2022), 27168–27183.
  307. EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models. arXiv preprint arXiv:2308.14352 (2023).
  308. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In CVPR. 4133–4141.
  309. AcTune: Uncertainty-based active self-training for active fine-tuning of pretrained language models. In NAACL. 1422–1436.
  310. Cold-start active learning through self-supervised language modeling. arXiv preprint arXiv:2010.09535 (2020).
  311. RPTQ: Reorder-based Post-training Quantization for Large Language Models. arXiv preprint arXiv:2304.01089 (2023).
  312. Sergey Zagoruyko and Nikos Komodakis. 2016. Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer. In ICLR.
  313. Big bird: Transformers for longer sequences. NeurIPS (2020), 17283–17297.
  314. BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models. In ACL. 1–9.
  315. Swag: A large-scale adversarial dataset for grounded commonsense inference. arXiv preprint arXiv:1808.05326 (2018).
  316. Data-centric artificial intelligence: A survey. arXiv preprint arXiv:2303.10158 (2023).
  317. ChengXiang Zhai et al. 2008. Statistical language models for information retrieval a critical review. Foundations and Trends® in Information Retrieval 2, 3 (2008), 137–213.
  318. An attention free transformer. arXiv preprint arXiv:2105.14103 (2021).
  319. Lifting the Curse of Capacity Gap in Distilling Language Models. arXiv preprint arXiv:2305.12129 (2023).
  320. Autoassist: A framework to accelerate training of deep neural networks. Advances in Neural Information Processing Systems 32 (2019).
  321. Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning. arXiv preprint arXiv:2305.18403 (2023).
  322. Adaptive budget allocation for parameter-efficient fine-tuning. arXiv preprint arXiv:2303.10512 (2023).
  323. Instruction Tuning for Large Language Models: A Survey. arXiv preprint arXiv:2308.10792 (2023).
  324. Allsh: Active learning guided by local sensitivity and hardness. arXiv preprint arXiv:2205.04980 (2022).
  325. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022).
  326. Ternarybert: Distillation-aware ultra-low bit bert. arXiv preprint arXiv:2009.12812 (2020).
  327. Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs. arXiv preprint arXiv:2310.08915 (2023).
  328. Reinforced curriculum learning on pre-trained neural machine translation models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 9652–9659.
  329. A survey of large language models. arXiv preprint arXiv:2303.18223 (2023).
  330. Fedor Zhdanov. 2019. Diverse mini-batch active learning. arXiv preprint arXiv:1901.05954 (2019).
  331. Alpa: Automating inter-and {{\{{Intra-Operator}}\}} parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 559–578.
  332. Learning n: m fine-grained structured sparse neural networks from scratch. arXiv preprint arXiv:2102.04010 (2021).
  333. A comprehensive survey on pretrained foundation models: A history from bert to chatgpt. arXiv preprint arXiv:2302.09419 (2023).
  334. On the Optimization Landscape of Neural Collapse under MSE Loss: Global Optimality with Unconstrained Features. In Proceedings of the 39th International Conference on Machine Learning. 27179–27202.
  335. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160 (2016).
  336. Large Language Models are Human-Level Prompt Engineers. In ICLR.
  337. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023).
  338. Combining curriculum learning and knowledge distillation for dialogue generation. In Findings of the Association for Computational Linguistics: EMNLP 2021. 1284–1295.
  339. A Survey on Model Compression for Large Language Models. arXiv preprint arXiv:2308.07633 (2023).
  340. A geometric analysis of neural collapse with unconstrained features. Advances in Neural Information Processing Systems (2021), 29820–29834.
  341. Zeyuan Allen Zhu and Yuanzhi Li. 2023. Physics of Language Models: Part 3.1, Knowledge Storage and Extraction. arXiv:2309.14316 [cs.CL]
  342. A survey on efficient training of transformers. arXiv preprint arXiv:2302.01107 (2023).
  343. Long-range Sequence Modeling with Predictable Sparse Attention. In ACL. 234–243.
  344. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593 (2019).
  345. Designing effective sparse expert models. IPDPSW (2022).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Tianyu Ding (36 papers)
  2. Tianyi Chen (139 papers)
  3. Haidong Zhu (15 papers)
  4. Jiachen Jiang (73 papers)
  5. Yiqi Zhong (19 papers)
  6. Jinxin Zhou (16 papers)
  7. Guangzhi Wang (17 papers)
  8. Zhihui Zhu (79 papers)
  9. Ilya Zharkov (25 papers)
  10. Luming Liang (27 papers)
Citations (17)
Github Logo Streamline Icon: https://streamlinehq.com