Model Compression and Efficient Inference for Large Language Models: A Survey (2402.09748v1)
Abstract: Transformer based LLMs have achieved tremendous success. However, the significant memory and computational costs incurred during the inference process make it challenging to deploy large models on resource-constrained devices. In this paper, we investigate compression and efficient inference methods for LLMs from an algorithmic perspective. Regarding taxonomy, similar to smaller models, compression and acceleration algorithms for LLMs can still be categorized into quantization, pruning, distillation, compact architecture design, dynamic networks. However, LLMs have two prominent characteristics compared to smaller models: (1) Most of compression algorithms require finetuning or even retraining the model after compression. The most notable aspect of large models is the very high cost associated with model finetuning or training. Therefore, many algorithms for large models, such as quantization and pruning, start to explore tuning-free algorithms. (2) Large models emphasize versatility and generalization rather than performance on a single task. Hence, many algorithms, such as knowledge distillation, focus on how to preserving their versatility and generalization after compression. Since these two characteristics were not very pronounced in early large models, we further distinguish LLMs into medium models and ``real'' large models. Additionally, we also provide an introduction to some mature frameworks for efficient inference of large models, which can support basic compression or acceleration algorithms, greatly facilitating model deployment for users.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving language understanding by generative pre-training,” 2018.
- A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
- L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,” in Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., 2022.
- T. Lin, Y. Wang, X. Liu, and X. Qiu, “A survey of transformers,” CoRR, vol. abs/2106.04554, 2021.
- S. Islam, H. Elmekki, A. Elsebai, J. Bentahar, N. Drawel, G. Rjoub, and W. Pedrycz, “A comprehensive survey on applications of transformers for deep learning tasks,” CoRR, vol. abs/2306.07303, 2023.
- T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
- Y. Wang, H. Chen, Y. Tang, T. Guo, K. Han, Y. Nie, X. Wang, H. Hu, Z. Bai, Y. Wang, F. Liu, Z. Liu, J. Guo, S. Zeng, Y. Zhang, Q. Xu, Q. Liu, J. Yao, C. Xu, and D. Tao, “Pangu-π𝜋\piitalic_π: Enhancing language model architectures via nonlinearity compensation,” CoRR, vol. abs/2312.17276, 2023.
- Z. Zhang, X. Han, H. Zhou, P. Ke, Y. Gu, D. Ye, Y. Qin, Y. Su, H. Ji, J. Guan, F. Qi, X. Wang, Y. Zheng, G. Zeng, H. Cao, S. Chen, D. Li, Z. Sun, Z. Liu, M. Huang, W. Han, J. Tang, J. Li, X. Zhu, and M. Sun, “CPM: A large-scale generative chinese pre-trained language model,” AI Open, vol. 2, pp. 93–99, 2021.
- T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilic, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé, J. Tow, A. M. Rush, S. Biderman, A. Webson, P. S. Ammanamanchi, T. Wang, B. Sagot, N. Muennighoff, A. V. del Moral, O. Ruwase, R. Bawden, S. Bekman, A. McMillan-Major, I. Beltagy, H. Nguyen, L. Saulnier, S. Tan, P. O. Suarez, V. Sanh, H. Laurençon, Y. Jernite, J. Launay, M. Mitchell, C. Raffel, A. Gokaslan, A. Simhi, A. Soroa, A. F. Aji, A. Alfassy, A. Rogers, A. K. Nitzav, C. Xu, C. Mou, C. Emezue, C. Klamm, C. Leong, D. van Strien, D. I. Adelani, and et al., “BLOOM: A 176b-parameter open-access multilingual language model,” CoRR, vol. abs/2211.05100, 2022.
- S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. T. Diab, X. Li, X. V. Lin, T. Mihaylov, M. Ott, S. Shleifer, K. Shuster, D. Simig, P. S. Koura, A. Sridhar, T. Wang, and L. Zettlemoyer, “OPT: open pre-trained transformer language models,” CoRR, vol. abs/2205.01068, 2022.
- A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y. Xu, W. Zheng, X. Xia et al., “Glm-130b: An open bilingual pre-trained model,” arXiv preprint arXiv:2210.02414, 2022.
- A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat, S. Dev, H. Michalewski, X. Garcia, V. Misra, K. Robinson, L. Fedus, D. Zhou, D. Ippolito, D. Luan, H. Lim, B. Zoph, A. Spiridonov, R. Sepassi, D. Dohan, S. Agrawal, M. Omernick, A. M. Dai, T. S. Pillai, M. Pellat, A. Lewkowycz, E. Moreira, R. Child, O. Polozov, K. Lee, Z. Zhou, X. Wang, B. Saeta, M. Diaz, O. Firat, M. Catasta, J. Wei, K. Meier-Hellstern, D. Eck, J. Dean, S. Petrov, and N. Fiedel, “Palm: Scaling language modeling with pathways,” J. Mach. Learn. Res., vol. 24, pp. 240:1–240:113, 2023.
- J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan, W. Ge, Y. Han, F. Huang, B. Hui, L. Ji, M. Li, J. Lin, R. Lin, D. Liu, G. Liu, C. Lu, K. Lu, J. Ma, R. Men, X. Ren, X. Ren, C. Tan, S. Tan, J. Tu, P. Wang, S. Wang, W. Wang, S. Wu, B. Xu, J. Xu, A. Yang, H. Yang, J. Yang, S. Yang, Y. Yao, B. Yu, H. Yuan, Z. Yuan, J. Zhang, X. Zhang, Y. Zhang, Z. Zhang, C. Zhou, J. Zhou, X. Zhou, and T. Zhu, “Qwen technical report,” CoRR, vol. abs/2309.16609, 2023.
- Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and Q. Liu, “ERNIE: enhanced language representation with informative entities,” in Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, A. Korhonen, D. R. Traum, and L. Màrquez, Eds. Association for Computational Linguistics, 2019, pp. 1441–1451.
- H. Touvron, L. Martin, K. R. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. M. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. S. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. M. Kloumann, A. V. Korenev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, and T. Scialom, “Llama 2: Open foundation and fine-tuned chat models,” ArXiv, vol. abs/2307.09288, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:259950998
- J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. H. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, and W. Fedus, “Emergent abilities of large language models,” Trans. Mach. Learn. Res., vol. 2022, 2022.
- G. Yang, D. Lo, R. Mullins, and Y. Zhao, “Dynamic stashing quantization for efficient transformer training,” arXiv preprint arXiv:2303.05295, 2023.
- Z. Yao, X. Wu, C. Li, S. Youn, and Y. He, “Zeroquant-v2: Exploring post-training quantization in llms from comprehensive study to low rank compensation,” 2023.
- Y. Bondarenko, M. Nagel, and T. Blankevoort, “Understanding and overcoming the challenges of efficient transformer quantization,” arXiv preprint arXiv:2109.12948, 2021.
- O. Zafrir, G. Boudoukh, P. Izsak, and M. Wasserblat, “Q8bert: Quantized 8bit bert,” in 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS). IEEE, 2019, pp. 36–39.
- B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko, “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2704–2713.
- S. Shen, Z. Dong, J. Ye, L. Ma, Z. Yao, A. Gholami, M. W. Mahoney, and K. Keutzer, “Q-bert: Hessian based ultra low precision quantization of bert,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, 2020, pp. 8815–8821.
- T. Piao, I. Cho, and U. Kang, “Sensimix: Sensitivity-aware 8-bit index & 1-bit value mixed precision quantization for bert compression,” PloS one, vol. 17, no. 4, p. e0265621, 2022.
- W. Zhang, L. Hou, Y. Yin, L. Shang, X. Chen, X. Jiang, and Q. Liu, “Ternarybert: Distillation-aware ultra-low bit bert,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 509–521.
- H. Qin, Y. Ding, M. Zhang, Y. Qinghua, A. Liu, Q. Dang, Z. Liu, and X. Liu, “Bibert: Accurate fully binarized bert,” in International Conference on Learning Representations, 2021.
- C. Zhao, T. Hua, Y. Shen, Q. Lou, and H. Jin, “Automatic mixed-precision quantization search of bert,” arXiv preprint arXiv:2112.14938, 2021.
- Z. Zhao, Y. Liu, L. Chen, Q. Liu, R. Ma, and K. Yu, “An investigation on different underlying quantization schemes for pre-trained language models,” in Natural Language Processing and Chinese Computing: 9th CCF International Conference, NLPCC 2020, Zhengzhou, China, October 14–18, 2020, Proceedings, Part I 9. Springer, 2020, pp. 359–371.
- B. Wang, Y. Ren, L. Shang, X. Jiang, and Q. Liu, “Exploring extreme parameter compression for pre-trained language models,” in International Conference on Learning Representations, 2021.
- H. Bai, W. Zhang, L. Hou, L. Shang, J. Jin, X. Jiang, Q. Liu, M. Lyu, and I. King, “Binarybert: Pushing the limit of bert quantization,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 4334–4348.
- A. H. Zadeh, I. Edo, O. M. Awad, and A. Moshovos, “Gobo: Quantizing attention-based nlp models for low latency and energy efficient inference,” in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2020, pp. 811–824.
- S. Kim, A. Gholami, Z. Yao, M. W. Mahoney, and K. Keutzer, “I-bert: Integer-only bert quantization,” in International conference on machine learning. PMLR, 2021, pp. 5506–5518.
- S. Dai, R. Venkatesan, M. Ren, B. Zimmer, W. Dally, and B. Khailany, “Vs-quant: Per-vector scaled quantization for accurate low-precision neural network inference,” Proceedings of Machine Learning and Systems, vol. 3, pp. 873–884, 2021.
- T. Li, Y. E. Mesbahi, I. Kobyzev, A. Rashid, A. Mahmud, N. Anchuri, H. Hajimolahoseini, Y. Liu, and M. Rezagholizadeh, “A short study on compressing decoder-based language models,” arXiv preprint arXiv:2110.08460, 2021.
- C. Tao, L. Hou, W. Zhang, L. Shang, X. Jiang, Q. Liu, P. Luo, and N. Wong, “Compression of generative pre-trained language models via quantization,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 4821–4836.
- Z. Li, Z. Wang, M. Tan, R. Nallapati, P. Bhatia, A. Arnold, B. Xiang, and D. Roth, “Dq-bart: Efficient sequence-to-sequence model via joint distillation and quantization,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2022, pp. 203–211.
- A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer, “A survey of quantization methods for efficient neural network inference,” in Low-Power Computer Vision. Chapman and Hall/CRC, 2022, pp. 291–326.
- C. Xu and J. McAuley, “A survey on model compression and acceleration for pretrained language models,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 9, 2023, pp. 10 566–10 575.
- J. Lin, J. Tang, H. Tang, S. Yang, X. Dang, and S. Han, “Awq: Activation-aware weight quantization for llm compression and acceleration,” arXiv preprint arXiv:2306.00978, 2023.
- E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “Optq: Accurate quantization for generative pre-trained transformers,” in The Eleventh International Conference on Learning Representations, 2022.
- T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer, “Llm. int8 (): 8-bit matrix multiplication for transformers at scale,” arXiv preprint arXiv:2208.07339, 2022.
- Z. Yao, R. Yazdani Aminabadi, M. Zhang, X. Wu, C. Li, and Y. He, “Zeroquant: Efficient and affordable post-training quantization for large-scale transformers,” Advances in Neural Information Processing Systems, vol. 35, pp. 27 168–27 183, 2022.
- G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han, “Smoothquant: Accurate and efficient post-training quantization for large language models,” in International Conference on Machine Learning. PMLR, 2023, pp. 38 087–38 099.
- Z. Liu, B. Oguz, C. Zhao, E. Chang, P. Stock, Y. Mehdad, Y. Shi, R. Krishnamoorthi, and V. Chandra, “Llm-qat: Data-free quantization aware training for large language models,” arXiv preprint arXiv:2305.17888, 2023.
- Y. Chai, J. Gkountouras, G. G. Ko, D. Brooks, and G.-Y. Wei, “Int2. 1: Towards fine-tunable quantized large language models with error correction through low-rank adaptation,” arXiv preprint arXiv:2306.08162, 2023.
- T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,” arXiv preprint arXiv:2305.14314, 2023.
- S. Kim, C. Hooper, A. Gholami, Z. Dong, X. Li, S. Shen, M. W. Mahoney, and K. Keutzer, “Squeezellm: Dense-and-sparse quantization,” arXiv preprint arXiv:2306.07629, 2023.
- Y. J. Kim, R. Henry, R. Fahim, and H. H. Awadalla, “Finequant: Unlocking efficiency with fine-grained weight-only quantization for llms,” arXiv preprint arXiv:2308.09723, 2023.
- G. Park, B. Park, S. J. Kwon, B. Kim, Y. Lee, and D. Lee, “nuqmm: Quantized matmul for efficient inference of large-scale generative language models,” arXiv preprint arXiv:2206.09557, 2022.
- M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Imagenet classification using binary convolutional neural networks,” in European conference on computer vision. Springer, 2016, pp. 525–542.
- T. Dettmers, M. Lewis, S. Shleifer, and L. Zettlemoyer, “8-bit optimizers via block-wise quantization,” in International Conference on Learning Representations, 2021.
- J. H. Lee, J. Kim, S. J. Kwon, and D. Lee, “Flexround: Learnable rounding based on element-wise division for post-training quantization,” arXiv preprint arXiv:2306.00317, 2023.
- J. Chee, Y. Cai, V. Kuleshov, and C. De Sa, “Quip: 2-bit quantization of large language models with guarantees,” arXiv preprint arXiv:2307.13304, 2023.
- E. Frantar and D. Alistarh, “Optimal brain compression: A framework for accurate post-training quantization and pruning,” Advances in Neural Information Processing Systems, vol. 35, pp. 4475–4488, 2022.
- C. Lee, J. Jin, T. Kim, H. Kim, and E. Park, “Owq: Lessons learned from activation outliers for weight quantization in large language models,” arXiv preprint arXiv:2306.02272, 2023.
- W. Cheng, W. Zhang, H. Shen, Y. Cai, X. He, and K. Lv, “Optimize weight rounding via signed gradient descent for the quantization of llms,” arXiv preprint arXiv:2309.05516, 2023.
- M. Nagel, R. A. Amjad, M. Van Baalen, C. Louizos, and T. Blankevoort, “Up or down? adaptive rounding for post-training quantization,” in International Conference on Machine Learning. PMLR, 2020, pp. 7197–7206.
- S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel mixture models,” in International Conference on Learning Representations, 2016.
- Y. Li, Y. Yu, C. Liang, P. He, N. Karampatziakis, W. Chen, and T. Zhao, “Loftq: Lora-fine-tuning-aware quantization for large language models,” arXiv preprint arXiv:2310.08659, 2023.
- Z. Yuan, L. Niu, J. Liu, W. Liu, X. Wang, Y. Shang, G. Sun, Q. Wu, J. Wu, and B. Wu, “Rptq: Reorder-based post-training quantization for large language models,” arXiv preprint arXiv:2304.01089, 2023.
- Y. Zhang, L. Zhao, S. Cao, W. Wang, T. Cao, F. Yang, M. Yang, S. Zhang, and N. Xu, “Integer or floating point? new outlooks for low-bit quantization on large language models,” arXiv preprint arXiv:2305.12356, 2023.
- X. Wu, Z. Yao, and Y. He, “Zeroquant-fp: A leap forward in llms post-training w4a8 quantization using floating-point formats,” arXiv preprint arXiv:2307.09782, 2023.
- X. Wu, Z. Yao, and Y. H. Zeroquant-fp, “A leap forward in llms post-training w4a8 quantization using floating-point formats,” arXiv preprint arXiv:2307.09782, 2023.
- X. Wei, Y. Zhang, X. Zhang, R. Gong, S. Zhang, Q. Zhang, F. Yu, and X. Liu, “Outlier suppression: Pushing the limit of low-bit transformer language models,” Advances in Neural Information Processing Systems, vol. 35, pp. 17 402–17 414, 2022.
- X. Wei, Y. Zhang, Y. Li, X. Zhang, R. Gong, J. Guo, and X. Liu, “Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling,” arXiv preprint arXiv:2304.09145, 2023.
- Q. Li, Y. Zhang, L. Li, P. Yao, B. Zhang, X. Chu, Y. Sun, L. Du, and Y. Xie, “Fptq: Fine-grained post-training quantization for large language models,” arXiv preprint arXiv:2308.15987, 2023.
- W. Shao, M. Chen, Z. Zhang, P. Xu, L. Zhao, Z. Li, K. Zhang, P. Gao, Y. Qiao, and P. Luo, “Omniquant: Omnidirectionally calibrated quantization for large language models,” arXiv preprint arXiv:2308.13137, 2023.
- J. Liu, R. Gong, X. Wei, Z. Dong, J. Cai, and B. Zhuang, “Qllm: Accurate and efficient low-bitwidth quantization for large language models,” arXiv preprint arXiv:2310.08041, 2023.
- E. Yvinec, A. Dapgony, M. Cord, and K. Bailly, “Rex: Data-free residual quantization error expansion,” arXiv preprint arXiv:2203.14645, 2022.
- C. Guo, J. Tang, W. Hu, J. Leng, C. Zhang, F. Yang, Y. Liu, M. Guo, and Y. Zhu, “Olive: Accelerating large language models via hardware-friendly outlier-victim pair quantization,” in Proceedings of the 50th Annual International Symposium on Computer Architecture, 2023, pp. 1–15.
- M. Kim, S. Lee, S. Hong, D.-S. Chang, and J. Choi, “Understanding and improving knowledge distillation for quantization-aware training of large transformer encoders,” arXiv preprint arXiv:2211.11014, 2022.
- J. O. Neill and S. Dutta, “Self-distilled quantization: Achieving high compression rates in transformer-based language models,” arXiv preprint arXiv:2307.05972, 2023.
- W.-Y. Hua, B. Williams, and D. Shamsi, “Lacos-bloom: Low-rank adaptation with contrastive objective on 8 bits siamese-bloom,” arXiv preprint arXiv:2305.06404, 2023.
- A. Kaushal, T. Vaidhya, and I. Rish, “Lord: Low rank decomposition of monolingual code llms for one-shot compression,” arXiv preprint arXiv:2309.14021, 2023.
- Y. Xu, L. Xie, X. Gu, X. Chen, H. Chang, H. Zhang, Z. Chen, X. Zhang, and Q. Tian, “Qa-lora: Quantization-aware low-rank adaptation of large language models,” arXiv preprint arXiv:2309.14717, 2023.
- S. J. Kwon, J. Kim, J. Bae, K. M. Yoo, J.-H. Kim, B. Park, B. Kim, J.-W. Ha, N. Sung, and D. Lee, “Alphatuning: Quantization-aware parameter-efficient adaptation of large-scale pre-trained language models,” arXiv preprint arXiv:2210.03858, 2022.
- J. Kim, J. H. Lee, S. Kim, J. Park, K. M. Yoo, S. J. Kwon, and D. Lee, “Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization,” arXiv preprint arXiv:2305.14152, 2023.
- M. Park, J. You, M. Nagel, and S. Chang, “Quadapter: Adapter for gpt-2 quantization,” in Findings of the Association for Computational Linguistics: EMNLP 2022, 2022, pp. 2510–2517.
- Z. Xu, Z. Liu, B. Chen, Y. Tang, J. Wang, K. Zhou, X. Hu, and A. Shrivastava, “Compress, then prompt: Improving accuracy-efficiency trade-off of llm inference with transferable prompt,” arXiv preprint arXiv:2305.11186, 2023.
- H. Shen, H. Meng, B. Dong, Z. Wang, O. Zafrir, Y. Ding, Y. Luo, H. Chang, Q. Gao, Z. Wang et al., “An efficient sparse inference software accelerator for transformer-based language models on cpus,” arXiv preprint arXiv:2306.16601, 2023.
- T. Pegolotti, E. Frantar, D. Alistarh, and M. Püschel, “Generating efficient kernels for quantized inference on large language models,” in Workshop on Efficient Systems for Foundation Models@ ICML2023, 2023.
- K. Wang, Z. Liu, Y. Lin, J. Lin, and S. Han, “Haq: Hardware-aware automated quantization with mixed precision,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 8612–8620.
- C. Yu, T. Chen, and Z. Gan, “Boost transformer-based language models with gpu-friendly sparsity and quantization,” in Findings of the Association for Computational Linguistics: ACL 2023, 2023, pp. 218–235.
- Z. Lin, G. Qu, Q. Chen, X. Chen, Z. Chen, and K. Huang, “Pushing large language models to the 6g edge: Vision, challenges, and opportunities,” arXiv preprint arXiv:2309.16739, 2023.
- M. W. U. Rahman, M. M. Abrar, H. G. Copening, S. Hariri, S. Shao, P. Satam, and S. Salehi, “Quantized transformer language model implementations on edge devices,” arXiv preprint arXiv:2310.03971, 2023.
- E. Kurtic, D. Kuznedelev, E. Frantar, M. Goin, and D. Alistarh, “Sparse finetuning for inference acceleration of large language models,” arXiv preprint arXiv:2310.06927, 2023.
- B. Isik, H. Kumbong, W. Ning, X. Yao, S. Koyejo, and C. Zhang, “Gpt-zip: Deep compression of finetuned large language models,” in Workshop on Efficient Systems for Foundation Models@ ICML2023, 2023.
- X. Wei, S. Gonugondla, W. Ahmad, S. Wang, B. Ray, H. Qian, X. Li, V. Kumar, Z. Wang, Y. Tian et al., “Greener yet powerful: Taming large code generation models with quantization,” arXiv preprint arXiv:2303.05378, 2023.
- T. Hu, C. Meinel, and H. Yang, “Empirical evaluation of post-training quantization methods for language tasks,” arXiv preprint arXiv:2210.16621, 2022.
- T. Dettmers and L. Zettlemoyer, “The case for 4-bit precision: k-bit inference scaling laws,” in International Conference on Machine Learning. PMLR, 2023, pp. 7750–7774.
- P. Liu, Z. Liu, Z.-F. Gao, D. Gao, W. X. Zhao, Y. Li, B. Ding, and J.-R. Wen, “Do emergent abilities exist in quantized large language models: An empirical study,” arXiv preprint arXiv:2307.08072, 2023.
- Y. Bondarenko, M. Nagel, and T. Blankevoort, “Quantizable transformers: Removing outliers by helping attention heads do nothing,” arXiv preprint arXiv:2306.12929, 2023.
- A. Ahmadian, S. Dash, H. Chen, B. Venkitesh, S. Gou, P. Blunsom, A. Üstün, and S. Hooker, “Intriguing properties of quantization at scale,” arXiv preprint arXiv:2305.19268, 2023.
- W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,” Advances in neural information processing systems, vol. 29, 2016.
- S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” Advances in neural information processing systems, vol. 28, 2015.
- S. Narang, E. Undersander, and G. Diamos, “Block-sparse recurrent neural networks,” arXiv preprint arXiv:1711.02782, 2017.
- M. A. Gordon, K. Duh, and N. Andrews, “Compressing bert: Studying the effects of weight pruning on transfer learning,” arXiv preprint arXiv:2002.08307, 2020.
- T. Chen, J. Frankle, S. Chang, S. Liu, Y. Zhang, Z. Wang, and M. Carbin, “The lottery ticket hypothesis for pre-trained bert networks,” Advances in neural information processing systems, vol. 33, pp. 15 834–15 846, 2020.
- S. Prasanna, A. Rogers, and A. Rumshisky, “When bert plays the lottery, all tickets are winning,” arXiv preprint arXiv:2005.00561, 2020.
- A. K. Jaiswal, S. Liu, T. Chen, Y. Ding, and Z. Wang, “Instant soup: Cheap pruning ensembles in a single pass can draw lottery tickets from large models,” in International Conference on Machine Learning. PMLR, 2023, pp. 14 691–14 701.
- O. Zafrir, A. Larey, G. Boudoukh, H. Shen, and M. Wasserblat, “Prune once for all: Sparse pre-trained language models,” arXiv preprint arXiv:2111.05754, 2021.
- M. Zhu and S. Gupta, “To prune, or not to prune: exploring the efficacy of pruning for model compression,” arXiv preprint arXiv:1710.01878, 2017.
- E. Kurtic and D. Alistarh, “Gmp*: Well-tuned global magnitude pruning can outperform most bert-pruning methods,” arXiv preprint arXiv:2210.06384, 2022.
- L. Yin, S. Liu, A. Jaiswal, S. Kundu, and Z. Wang, “Junk dna hypothesis: A task-centric angle of llm pre-trained weights through sparsity,” arXiv preprint arXiv:2310.02277, 2023.
- V. Sanh, T. Wolf, and A. Rush, “Movement pruning: Adaptive sparsity by fine-tuning,” Advances in Neural Information Processing Systems, vol. 33, pp. 20 378–20 389, 2020.
- T. Jiang, D. Wang, F. Zhuang, R. Xie, and F. Xia, “Pruning pre-trained language models without fine-tuning,” arXiv preprint arXiv:2210.06210, 2022.
- S. Ren and K. Q. Zhu, “Low-rank prune-and-factorize for language model compression,” arXiv preprint arXiv:2306.14152, 2023.
- Q. Zhang, S. Zuo, C. Liang, A. Bukharin, P. He, W. Chen, and T. Zhao, “Platon: Pruning large transformer models with upper confidence bound of weight importance,” in International Conference on Machine Learning. PMLR, 2022, pp. 26 809–26 823.
- Y. Li, F. Luo, C. Tan, M. Wang, S. Huang, S. Li, and J. Bai, “Parameter-efficient sparsity for large language models fine-tuning,” arXiv preprint arXiv:2205.11005, 2022.
- M. Zhang, C. Shen, Z. Yang, L. Ou, X. Yu, B. Zhuang et al., “Pruning meets low-rank parameter-efficient fine-tuning,” arXiv preprint arXiv:2305.18403, 2023.
- Y. LeCun, J. Denker, and S. Solla, “Optimal brain damage,” Advances in neural information processing systems, vol. 2, 1989.
- B. Hassibi, D. G. Stork, and G. J. Wolff, “Optimal brain surgeon and general network pruning,” in IEEE international conference on neural networks. IEEE, 1993, pp. 293–299.
- E. Kurtic, D. Campos, T. Nguyen, E. Frantar, M. Kurtz, B. Fineran, M. Goin, and D. Alistarh, “The optimal bert surgeon: Scalable and accurate second-order pruning for large language models,” arXiv preprint arXiv:2203.07259, 2022.
- C. Louizos, M. Welling, and D. P. Kingma, “Learning sparse neural networks through l_0𝑙_0l\_0italic_l _ 0 regularization,” arXiv preprint arXiv:1712.01312, 2017.
- F.-M. Guo, S. Liu, F. S. Mungall, X. Lin, and Y. Wang, “Reweighted proximal pruning for large-scale language representation,” arXiv preprint arXiv:1909.12486, 2019.
- A. Mishra, J. A. Latorre, J. Pool, D. Stosic, D. Stosic, G. Venkatesh, C. Yu, and P. Micikevicius, “Accelerating sparse deep neural networks,” arXiv preprint arXiv:2104.08378, 2021.
- A. Zhou, Y. Ma, J. Zhu, J. Liu, Z. Zhang, K. Yuan, W. Sun, and H. Li, “Learning n: m fine-grained structured sparse neural networks from scratch,” arXiv preprint arXiv:2102.04010, 2021.
- O. Nordström, “Unstructured pruning of pre-trained language models tuned for sentiment classification.” 2022.
- B. Cui, Y. Li, and Z. Zhang, “Joint structured pruning and dense knowledge distillation for efficient transformer model compression,” Neurocomputing, vol. 458, pp. 56–69, 2021.
- B. Li, Z. Kong, T. Zhang, J. Li, Z. Li, H. Liu, and C. Ding, “Efficient transformer-based large scale language representations using hardware-friendly block structured pruning,” arXiv preprint arXiv:2009.08065, 2020.
- P. Michel, O. Levy, and G. Neubig, “Are sixteen heads really better than one?” Advances in neural information processing systems, vol. 32, 2019.
- J. Li, R. Cotterell, and M. Sachan, “Differentiable subset pruning of transformer heads,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 1442–1459, 2021.
- Z. Yang, Y. Cui, X. Yao, and S. Wang, “Gradient-based intra-attention pruning on pre-trained language models,” arXiv preprint arXiv:2212.07634, 2022.
- G. Wang, Q. Cao, J. Yang, and Y. Sun, “Task-oriented memory-efficient pruning-adapter,” arXiv preprint arXiv:2303.14704, 2023.
- C. J. Maddison, A. Mnih, and Y. W. Teh, “The concrete distribution: A continuous relaxation of discrete random variables,” arXiv preprint arXiv:1611.00712, 2016.
- E. Voita, D. Talbot, F. Moiseev, R. Sennrich, and I. Titov, “Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned,” arXiv preprint arXiv:1905.09418, 2019.
- F. Lagunas, E. Charlaix, V. Sanh, and A. M. Rush, “Block pruning for faster transformers,” arXiv preprint arXiv:2109.04838, 2021.
- R. Xu, F. Luo, C. Wang, B. Chang, J. Huang, S. Huang, and F. Huang, “From dense to sparse: Contrastive pruning for better pre-trained language model compression,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 10, 2022, pp. 11 547–11 555.
- Z. Liu, F. Li, G. Li, and J. Cheng, “Ebert: Efficient bert inference with dynamic structured pruning,” in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2021, pp. 4814–4823.
- A. Khetan and Z. Karnin, “schubert: Optimizing elements of bert,” arXiv preprint arXiv:2005.06628, 2020.
- E. Kurtic, E. Frantar, and D. Alistarh, “Ziplm: Hardware-aware structured pruning of language models,” arXiv preprint arXiv:2302.04089, 2023.
- A. Klein, J. Golebiowski, X. Ma, V. Perrone, and C. Archambeau, “Structural pruning of large language models via neural architecture search,” in AutoML Conference 2023 (Workshop), 2023.
- S. Park, H. Choi, and U. Kang, “Knowledge-preserving pruning for pre-trained language models without retraining,” arXiv preprint arXiv:2308.03449, 2023.
- Y. Li, Y. Yu, Q. Zhang, C. Liang, P. He, W. Chen, and T. Zhao, “Losparse: Structured compression of large language models based on low-rank and sparse approximation,” arXiv preprint arXiv:2306.11222, 2023.
- M. Santacroce, Z. Wen, Y. Shen, and Y. Li, “What matters in the structured pruning of generative language models?” arXiv preprint arXiv:2302.03773, 2023.
- N. Yang, Y. Jang, H. Lee, S. Jeong, and K. Jung, “Task-specific compression for multi-task language models using attribution-based pruning,” in Findings of the Association for Computational Linguistics: EACL 2023, 2023, pp. 582–592.
- J. McCarley, R. Chakravarti, and A. Sil, “Structured pruning of a bert-based question answering model,” arXiv preprint arXiv:1910.06360, 2019.
- Z. Wang, J. Wohlwend, and T. Lei, “Structured pruning of large language models,” arXiv preprint arXiv:1910.04732, 2019.
- M. Xia, Z. Zhong, and D. Chen, “Structured pruning learns compact and accurate models,” arXiv preprint arXiv:2204.00408, 2022.
- C. Tao, L. Hou, H. Bai, J. Wei, X. Jiang, Q. Liu, P. Luo, and N. Wong, “Structured pruning for efficient generative pre-trained language models,” in Findings of the Association for Computational Linguistics: ACL 2023, 2023, pp. 10 880–10 895.
- A. Fan, E. Grave, and A. Joulin, “Reducing transformer depth on demand with structured dropout,” arXiv preprint arXiv:1909.11556, 2019.
- M. Zhang and Y. He, “Accelerating training of transformer-based language models with progressive layer dropping,” Advances in Neural Information Processing Systems, vol. 33, pp. 14 011–14 023, 2020.
- H. Sajjad, F. Dalvi, N. Durrani, and P. Nakov, “On the effect of dropping layers of pre-trained transformer models,” Computer Speech & Language, vol. 77, p. 101429, 2023.
- S. Goyal, A. R. Choudhury, S. Raje, V. Chakaravarthy, Y. Sabharwal, and A. Verma, “Power-bert: Accelerating bert inference via progressive word-vector elimination,” in International Conference on Machine Learning. PMLR, 2020, pp. 3690–3699.
- S. Kim, S. Shen, D. Thorsley, A. Gholami, W. Kwon, J. Hassoun, and K. Keutzer, “Learned token pruning for transformers,” in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, pp. 784–794.
- H. Wang, Z. Zhang, and S. Han, “Spatten: Efficient sparse attention architecture with cascade token and head pruning,” in 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2021, pp. 97–110.
- Z. Lin, J. Z. Liu, Z. Yang, N. Hua, and D. Roth, “Pruning redundant mappings in transformer models via spectral-normalized identity prior,” arXiv preprint arXiv:2010.01791, 2020.
- M. Sun, Z. Liu, A. Bair, and J. Z. Kolter, “A simple and effective pruning approach for large language models,” arXiv preprint arXiv:2306.11695, 2023.
- Y. Zhang, H. Bai, H. Lin, J. Zhao, L. Hou, and C. V. Cannistraci, “An efficient plug-and-play post-training pruning strategy in large language models,” 2023.
- Y. Li, L. Niu, X. Zhang, K. Liu, J. Zhu, and Z. Kang, “E-sparse: Boosting the large language model inference through entropy-based n: M sparsity,” arXiv preprint arXiv:2310.15929, 2023.
- E. Frantar and D. Alistarh, “Sparsegpt: Massive language models can be accurately pruned in one-shot,” 2023.
- H. Shao, B. Liu, and Y. Qian, “One-shot sensitivity-aware mixed sparsity pruning for large language models,” arXiv preprint arXiv:2310.09499, 2023.
- R. J. Das, L. Ma, and Z. Shen, “Beyond size: How gradients shape pruning decisions in large language models,” arXiv preprint arXiv:2311.04902, 2023.
- Anonymous, “Pushing gradient towards zero: A novel pruning method for large language models,” 2024. [Online]. Available: https://openreview.net/forum?id=IU4L7wiwxw
- Y. An, X. Zhao, T. Yu, M. Tang, and J. Wang, “Fluctuation-based adaptive structured pruning for large language models,” arXiv preprint arXiv:2312.11983, 2023.
- S. Ashkboos, M. L. Croci, M. G. d. Nascimento, T. Hoefler, and J. Hensman, “Slicegpt: Compress large language models by deleting rows and columns,” arXiv preprint arXiv:2401.15024, 2024.
- X. Ma, G. Fang, and X. Wang, “Llm-pruner: On the structural pruning of large language models,” arXiv preprint arXiv:2305.11627, 2023.
- T. Chen, T. Ding, B. Yadav, I. Zharkov, and L. Liang, “Lorashear: Efficient large language model structured pruning and knowledge recovery,” arXiv preprint arXiv:2310.18356, 2023.
- B. Zhao, H. Hajishirzi, and Q. Cao, “Apt: Adaptive pruning and tuning pretrained language models for efficient training and inference,” arXiv preprint arXiv:2401.12200, 2024.
- M. Xia, T. Gao, Z. Zeng, and D. Chen, “Sheared llama: Accelerating language model pre-training via structured pruning,” arXiv preprint arXiv:2310.06694, 2023.
- S. Guo, J. Xu, L. L. Zhang, and M. Yang, “Compresso: Structured pruning with collaborative prompting learns compact large language models,” arXiv preprint arXiv:2310.05015, 2023.
- T. F. van der Ouderaa, M. Nagel, M. van Baalen, Y. M. Asano, and T. Blankevoort, “The llm surgeon,” arXiv preprint arXiv:2312.17244, 2023.
- M. Williams and N. Aletras, “How does calibration data affect the post-training pruning and quantization of large language models?” arXiv preprint arXiv:2311.09755, 2023.
- M. Zimmer, M. Andoni, C. Spiegel, and S. Pokutta, “Perp: Rethinking the prune-retrain paradigm in the era of llms,” arXiv preprint arXiv:2312.15230, 2023.
- S. Gholami and M. Omar, “Can pruning make large language models more efficient?” arXiv preprint arXiv:2310.04573, 2023.
- T. Valicenti, J. Vidal, and R. Patnaik, “Mini-gpts: Efficient large language models through contextual pruning,” arXiv preprint arXiv:2312.12682, 2023.
- Y. Ji, Y. Cao, and J. Liu, “Pruning large language models via accuracy predictor,” arXiv preprint arXiv:2309.09507, 2023.
- Anonymous, “Outlier weighed layerwise sparsity (OWL): A missing secret sauce for pruning LLMs to high sparsity,” in Submitted to The Twelfth International Conference on Learning Representations, 2023, under review. [Online]. Available: https://openreview.net/forum?id=pOBvr1PxFd
- ——, “BESA: Pruning large language models with blockwise parameter-efficient sparsity allocation,” in The Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=gC6JTEU3jl
- A. Syed, P. H. Guo, and V. Sundarapandiyan, “Prune and tune: Improving efficient pruning techniques for massive language models,” 2023.
- Y. Zhang, L. Zhao, M. Lin, Y. Sun, Y. Yao, X. Han, J. Tanner, S. Liu, and R. Ji, “Dynamic sparse no training: Training-free fine-tuning for sparse llms,” arXiv preprint arXiv:2310.08915, 2023.
- V. Boža, “Fast and optimal weight update for pruned large language models,” 2024.
- H. Xia, Z. Zheng, Y. Li, D. Zhuang, Z. Zhou, X. Qiu, Y. Li, W. Lin, and S. L. Song, “Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity,” arXiv preprint arXiv:2309.10285, 2023.
- V. Srinivasan, D. Gandhi, U. Thakker, and R. Prabhakar, “Training large language models efficiently with sparsity and dataflow,” arXiv preprint arXiv:2304.05511, 2023.
- G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
- Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu, “Deep mutual learning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4320–4328.
- A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “Fitnets: Hints for thin deep nets,” arXiv preprint arXiv:1412.6550, 2014.
- J. Yim, D. Joo, J.-H. Bae, and J. Kim, “A gift from knowledge distillation: Fast optimization, network minimization and transfer learning,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7130–7138, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:206596723
- R. Tang, Y. Lu, L. Liu, L. Mou, O. Vechtomova, and J. Lin, “Distilling task-specific knowledge from bert into simple neural networks,” arXiv preprint arXiv:1903.12136, 2019.
- S. Sun, Y. Cheng, Z. Gan, and J. Liu, “Patient knowledge distillation for bert model compression,” arXiv preprint arXiv:1908.09355, 2019.
- L. Hou, Z. Huang, L. Shang, X. Jiang, X. Chen, and Q. Liu, “Dynabert: Dynamic bert with adaptive width and depth,” Advances in Neural Information Processing Systems, vol. 33, pp. 9782–9793, 2020.
- W. Zhou, C. Xu, and J. McAuley, “Bert learns to teach: Knowledge distillation with meta learning,” arXiv preprint arXiv:2106.04570, 2021.
- S. Wu, H. Chen, X. Quan, Q. Wang, and R. Wang, “Ad-kd: Attribution-driven knowledge distillation for language model compression,” arXiv preprint arXiv:2305.10010, 2023.
- D. Chen, Y. Li, M. Qiu, Z. Wang, B. Li, B. Ding, H. Deng, J. Huang, W. Lin, and J. Zhou, “Adabert: Task-adaptive bert compression with differentiable neural architecture search,” arXiv preprint arXiv:2001.04246, 2020.
- K. J. Liang, W. Hao, D. Shen, Y. Zhou, W. Chen, C. Chen, and L. Carin, “Mixkd: Towards efficient distillation of large-scale language models,” arXiv preprint arXiv:2011.00593, 2020.
- H. Pan, C. Wang, M. Qiu, Y. Zhang, Y. Li, and J. Huang, “Meta-kd: A meta knowledge distillation framework for language model compression across domains,” arXiv preprint arXiv:2012.01266, 2020.
- J. Zhang, A. Muhamed, A. Anantharaman, G. Wang, C. Chen, K. Zhong, Q. Cui, Y. Xu, B. Zeng, T. M. Chilimbi, and Y. Chen, “Reaugkd: Retrieval-augmented knowledge distillation for pre-trained language models,” in Annual Meeting of the Association for Computational Linguistics, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:259370551
- V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter,” arXiv preprint arXiv:1910.01108, 2019.
- W. Wang, F. Wei, L. Dong, H. Bao, N. Yang, and M. Zhou, “Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers,” Advances in Neural Information Processing Systems, vol. 33, pp. 5776–5788, 2020.
- Z. Sun, H. Yu, X. Song, R. Liu, Y. Yang, and D. Zhou, “Mobilebert: a compact task-agnostic bert for resource-limited devices,” arXiv preprint arXiv:2004.02984, 2020.
- C. Liang, H. Jiang, Z. Li, X. Tang, B. Yin, and T. Zhao, “Homodistil: Homotopic task-agnostic distillation of pre-trained transformers,” arXiv preprint arXiv:2302.09632, 2023.
- X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, and Q. Liu, “Tinybert: Distilling bert for natural language understanding,” arXiv preprint arXiv:1909.10351, 2019.
- C. Liang, S. Zuo, Q. Zhang, P. He, W. Chen, and T. Zhao, “Less is more: Task-aware layer-wise distillation for language model compression,” in International Conference on Machine Learning. PMLR, 2023, pp. 20 852–20 867.
- I. Turc, M.-W. Chang, K. Lee, and K. Toutanova, “Well-read students learn better: On the importance of pre-training compact models,” arXiv preprint arXiv:1908.08962, 2019.
- H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” arXiv preprint arXiv:1710.09412, 2017.
- S. Dasgupta, T. Cohn, and T. Baldwin, “Cost-effective distillation of large language models,” in Annual Meeting of the Association for Computational Linguistics, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:259858962
- S. I. Mirzadeh, M. Farajtabar, A. Li, N. Levine, A. Matsukawa, and H. Ghasemzadeh, “Improved knowledge distillation via teacher assistant,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 04, 2020, pp. 5191–5198.
- Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi, “Self-instruct: Aligning language model with self generated instructions,” arXiv preprint arXiv:2212.10560, 2022.
- B. Peng, C. Li, P. He, M. Galley, and J. Gao, “Instruction tuning with gpt-4,” arXiv preprint arXiv:2304.03277, 2023.
- M. Wu, A. Waheed, C. Zhang, M. Abdul-Mageed, and A. F. Aji, “Lamini-lm: A diverse herd of distilled models from large-scale instructions,” arXiv preprint arXiv:2304.14402, 2023.
- Y. Jiang, C. Chan, M. Chen, and W. Wang, “Lion: Adversarial distillation of closed-source large language model,” arXiv preprint arXiv:2305.12870, 2023.
- R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model,” https://github.com/tatsu-lab/stanford_alpaca, 2023.
- W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, I. Stoica, and E. P. Xing, “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,” March 2023. [Online]. Available: https://lmsys.org/blog/2023-03-30-vicuna/
- Y. Anand, Z. Nussbaum, B. Duderstadt, B. Schmidt, and A. Mulyar, “Gpt4all: Training an assistant-style chatbot with large scale data distillation from gpt-3.5-turbo,” GitHub, 2023.
- H. Chen, A. Saha, S. Hoi, and S. Joty, “Personalised distillation: Empowering open-sourced llms with adaptive learning for code generation,” arXiv preprint arXiv:2310.18628, 2023.
- W. Zhou, S. Zhang, Y. Gu, M. Chen, and H. Poon, “Universalner: Targeted distillation from large language models for open named entity recognition,” arXiv preprint arXiv:2308.03279, 2023.
- S. Li, J. Chen, Y. Shen, Z. Chen, X. Zhang, Z. Li, H. Wang, J. Qian, B. Peng, Y. Mao et al., “Explanations from large language models make small reasoners better,” arXiv preprint arXiv:2210.06726, 2022.
- L. C. Magister, J. Mallinson, J. Adamek, E. Malmi, and A. Severyn, “Teaching small language models to reason,” arXiv preprint arXiv:2212.08410, 2022.
- C.-Y. Hsieh, C.-L. Li, C.-K. Yeh, H. Nakhost, Y. Fujii, A. Ratner, R. Krishna, C.-Y. Lee, and T. Pfister, “Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes,” arXiv preprint arXiv:2305.02301, 2023.
- S. Wadhwa, S. Amir, and B. C. Wallace, “Revisiting relation extraction in the era of large language models,” arXiv preprint arXiv:2305.05003, 2023.
- N. Ho, L. Schmid, and S.-Y. Yun, “Large language models are reasoning teachers,” arXiv preprint arXiv:2212.10071, 2022.
- K. Shridhar, A. Stolfo, and M. Sachan, “Distilling reasoning capabilities into smaller language models,” in Findings of the Association for Computational Linguistics: ACL 2023, 2023, pp. 7059–7073.
- P. Wang, Z. Wang, Z. Li, Y. Gao, B. Yin, and X. Ren, “Scott: Self-consistent chain-of-thought distillation,” arXiv preprint arXiv:2305.01879, 2023.
- M. Kang, S. Lee, J. Baek, K. Kawaguchi, and S. J. Hwang, “Knowledge-augmented reasoning distillation for small language models in knowledge-intensive tasks,” arXiv preprint arXiv:2305.18395, 2023.
- Z. Jie and W. Lu, “Leveraging training data in few-shot prompting for numerical reasoning,” arXiv preprint arXiv:2305.18170, 2023.
- X. Zhu, B. Qi, K. Zhang, X. Long, and B. Zhou, “Pad: Program-aided distillation specializes large models in reasoning,” arXiv preprint arXiv:2305.13888, 2023.
- L. H. Li, J. Hessel, Y. Yu, X. Ren, K.-W. Chang, and Y. Choi, “Symbolic chain-of-thought distillation: Small models can also” think” step-by-step,” arXiv preprint arXiv:2306.14050, 2023.
- H. Chen, S. Wu, X. Quan, R. Wang, M. Yan, and J. Zhang, “Mcc-kd: Multi-cot consistent knowledge distillation,” arXiv preprint arXiv:2310.14747, 2023.
- H. Chae, Y. Song, K. T.-i. Ong, T. Kwon, M. Kim, Y. Yu, D. Lee, D. Kang, and J. Yeo, “Dialogue chain-of-thought distillation for commonsense-aware conversational agents,” arXiv preprint arXiv:2310.09343, 2023.
- Z. Wang, S. Huang, Y. Liu, J. Wang, M. Song, Z. Zhang, H. Huang, F. Wei, W. Deng, F. Sun et al., “Democratizing reasoning ability: Tailored learning from large language model,” arXiv preprint arXiv:2310.13332, 2023.
- Y. Ma, H. Jiang, and C. Fan, “Sci-cot: Leveraging large language models for enhanced knowledge distillation in small models for scientific qa,” arXiv preprint arXiv:2308.04679, 2023.
- Y. Fu, H. Peng, L. Ou, A. Sabharwal, and T. Khot, “Specializing smaller language models towards multi-step reasoning,” arXiv preprint arXiv:2301.12726, 2023.
- Y. Huang, Y. Chen, Z. Yu, and K. McKeown, “In-context learning distillation: Transferring few-shot learning ability of pre-trained language models,” arXiv preprint arXiv:2212.10670, 2022.
- L. Wang, N. Yang, and F. Wei, “Learning to retrieve in-context examples for large language models,” arXiv preprint arXiv:2307.07164, 2023.
- P. West, C. Bhagavatula, J. Hessel, J. D. Hwang, L. Jiang, R. L. Bras, X. Lu, S. Welleck, and Y. Choi, “Symbolic knowledge distillation: from general language models to commonsense models,” arXiv preprint arXiv:2110.07178, 2021.
- Z. Chen, Q. Gao, A. Bosselut, A. Sabharwal, and K. Richardson, “Disco: distilling counterfactuals with large language models,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 5514–5528.
- Y. Gu, S. Zhang, N. Usuyama, Y. Woldesenbet, C. Wong, P. Sanapathi, M. Wei, N. Valluri, E. Strandberg, T. Naumann et al., “Distilling large language models for biomedical knowledge extraction: A case study on adverse drug events,” arXiv preprint arXiv:2307.06439, 2023.
- G. Sahu, O. Vechtomova, D. Bahdanau, and I. H. Laradji, “Promptmix: A class boundary augmentation method for large language model distillation,” arXiv preprint arXiv:2310.14192, 2023.
- A. Gudibande, E. Wallace, C. Snell, X. Geng, H. Liu, P. Abbeel, S. Levine, and D. Song, “The false promise of imitating proprietary llms,” arXiv preprint arXiv:2305.15717, 2023.
- Y. Gu, L. Dong, F. Wei, and M. Huang, “Knowledge distillation of large language models,” arXiv preprint arXiv:2306.08543, 2023.
- R. Agarwal, N. Vieillard, Y. Zhou, P. Stanczyk, S. Ramos, M. Geist, and O. Bachem, “Generalized knowledge distillation for auto-regressive language models,” arXiv preprint arXiv:2306.13649, 2023.
- S. Padmanabhan, Y. Onoe, M. J. Zhang, G. Durrett, and E. Choi, “Propagating knowledge updates to lms through distillation,” arXiv preprint arXiv:2306.09306, 2023.
- M. Kim, S. Lee, J. Lee, S. Hong, D.-S. Chang, W. Sung, and J. Choi, “Token-scaled logit distillation for ternary weight generative language models,” arXiv preprint arXiv:2308.06744, 2023.
- C. Zhang, D. Song, Z. Ye, and Y. Gao, “Towards the law of capacity gap in distilling language models,” arXiv preprint arXiv:2311.07052, 2023.
- Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov, “Transformer-xl: Attentive language models beyond a fixed-length context,” arXiv preprint arXiv:1901.02860, 2019.
- R. Child, S. Gray, A. Radford, and I. Sutskever, “Generating long sequences with sparse transformers,” arXiv preprint arXiv:1904.10509, 2019.
- S. Sukhbaatar, E. Grave, P. Bojanowski, and A. Joulin, “Adaptive attention span in transformers,” arXiv preprint arXiv:1905.07799, 2019.
- G. M. Correia, V. Niculae, and A. F. Martins, “Adaptively sparse transformers,” arXiv preprint arXiv:1909.00015, 2019.
- Z. Ye, Q. Guo, Q. Gan, X. Qiu, and Z. Zhang, “Bp-transformer: Modelling long-range context via binary partitioning,” arXiv preprint arXiv:1911.04070, 2019.
- J. Qiu, H. Ma, O. Levy, S. W.-t. Yih, S. Wang, and J. Tang, “Blockwise self-attention for long document understanding,” arXiv preprint arXiv:1911.02972, 2019.
- I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long-document transformer,” arXiv preprint arXiv:2004.05150, 2020.
- N. Kitaev, Ł. Kaiser, and A. Levskaya, “Reformer: The efficient transformer,” arXiv preprint arXiv:2001.04451, 2020.
- M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang et al., “Big bird: Transformers for longer sequences,” Advances in neural information processing systems, vol. 33, pp. 17 283–17 297, 2020.
- Y. Tay, D. Bahri, L. Yang, D. Metzler, and D.-C. Juan, “Sparse sinkhorn attention,” in International Conference on Machine Learning. PMLR, 2020, pp. 9438–9447.
- X. Li, Y. Meng, M. Zhou, Q. Han, F. Wu, and J. Li, “Sac: Accelerating and structuring self-attention via sparse adaptive connection,” Advances in Neural Information Processing Systems, vol. 33, pp. 16 997–17 008, 2020.
- H. Ren, H. Dai, Z. Dai, M. Yang, J. Leskovec, D. Schuurmans, and B. Dai, “Combiner: Full attention transformer with sparse computation cost,” Advances in Neural Information Processing Systems, vol. 34, pp. 22 470–22 482, 2021.
- A. Roy, M. Saffar, A. Vaswani, and D. Grangier, “Efficient content-based sparse attention with routing transformers,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 53–68, 2021.
- A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, “Transformers are rnns: Fast autoregressive transformers with linear attention,” in International conference on machine learning. PMLR, 2020, pp. 5156–5165.
- K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser et al., “Rethinking attention with performers,” arXiv preprint arXiv:2009.14794, 2020.
- Y. Xiong, Z. Zeng, R. Chakraborty, M. Tan, G. Fung, Y. Li, and V. Singh, “Nyströmformer: A nyström-based algorithm for approximating self-attention,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 16, 2021, pp. 14 138–14 148.
- W. Hua, Z. Dai, H. Liu, and Q. Le, “Transformer quality in linear time,” in International Conference on Machine Learning. PMLR, 2022, pp. 9099–9117.
- I. Han, R. Jarayam, A. Karbasi, V. Mirrokni, D. P. Woodruff, and A. Zandieh, “Hyperattention: Long-context attention in near-linear time,” arXiv preprint arXiv:2310.05869, 2023.
- Z. Shen, M. Zhang, H. Zhao, S. Yi, and H. Li, “Efficient attention: Attention with linear complexities,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2021, pp. 3531–3539.
- A. Rahimi and B. Recht, “Random features for large-scale kernel machines,” Advances in neural information processing systems, vol. 20, 2007.
- S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma, “Linformer: Self-attention with linear complexity,” arXiv preprint arXiv:2006.04768, 2020.
- L. D. Lingle, “Transformer-vq: Linear-time transformers via vector quantization,” arXiv preprint arXiv:2309.16354, 2023.
- T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré, “Flashattention: Fast and memory-efficient exact attention with io-awareness,” Advances in Neural Information Processing Systems, vol. 35, pp. 16 344–16 359, 2022.
- T. Dao, “Flashattention-2: Faster attention with better parallelism and work partitioning,” arXiv preprint arXiv:2307.08691, 2023.
- H. Wang, Z. Wu, Z. Liu, H. Cai, L. Zhu, C. Gan, and S. Han, “Hat: Hardware-aware transformers for efficient natural language processing,” arXiv preprint arXiv:2005.14187, 2020.
- P. Ren, Y. Xiao, X. Chang, P.-Y. Huang, Z. Li, X. Chen, and X. Wang, “A comprehensive survey of neural architecture search: Challenges and solutions,” ACM Computing Surveys (CSUR), vol. 54, no. 4, pp. 1–34, 2021.
- Y. Liu, Y. Sun, B. Xue, M. Zhang, G. G. Yen, and K. C. Tan, “A survey on evolutionary neural architecture search,” IEEE transactions on neural networks and learning systems, 2021.
- T. Elsken, J. H. Metzen, and F. Hutter, “Neural architecture search: A survey,” The Journal of Machine Learning Research, vol. 20, no. 1, pp. 1997–2017, 2019.
- A. Wan, X. Dai, P. Zhang, Z. He, Y. Tian, S. Xie, B. Wu, M. Yu, T. Xu, K. Chen et al., “Fbnetv2: Differentiable neural architecture search for spatial and channel dimensions,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 12 965–12 974.
- B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda, Y. Jia, and K. Keutzer, “Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 10 734–10 742.
- D. So, Q. Le, and C. Liang, “The evolved transformer,” in International conference on machine learning. PMLR, 2019, pp. 5877–5886.
- Y. Zhao, L. Dong, Y. Shen, Z. Zhang, F. Wei, and W. Chen, “Memory-efficient differentiable transformer architecture search,” arXiv preprint arXiv:2105.14669, 2021.
- H. Liu, K. Simonyan, and Y. Yang, “Darts: Differentiable architecture search,” arXiv preprint arXiv:1806.09055, 2018.
- J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” arXiv preprint arXiv:2001.08361, 2020.
- J. Hestness, S. Narang, N. Ardalani, G. Diamos, H. Jun, H. Kianinejad, M. M. A. Patwary, Y. Yang, and Y. Zhou, “Deep learning scaling is predictable, empirically,” arXiv preprint arXiv:1712.00409, 2017.
- C. Xu and J. McAuley, “A survey on dynamic neural networks for natural language processing,” arXiv preprint arXiv:2202.07101, 2022.
- Y.-S. Chuang, Y. Xie, H. Luo, Y. Kim, J. R. Glass, and P. He, “Dola: Decoding by contrasting layers improves factuality in large language models,” ArXiv, vol. abs/2309.03883, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:261582463
- J. Xin, R. Tang, J. Lee, Y. Yu, and J. J. Lin, “Deebert: Dynamic early exiting for accelerating bert inference,” in Annual Meeting of the Association for Computational Linguistics, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:216552850
- W. Liu, P. Zhou, Z. Zhao, Z. Wang, H. Deng, and Q. Ju, “Fastbert: a self-distilling bert with adaptive inference time,” ArXiv, vol. abs/2004.02178, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:214802887
- S. Geng, P. Gao, Z. Fu, and Y. Zhang, “Romebert: Robust training of multi-exit bert,” ArXiv, vol. abs/2101.09755, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:231698881
- J. Wang, K. Chen, G. Chen, L. Shou, and J. McAuley, “Skipbert: Efficient inference with shallow layer skipping,” in Annual Meeting of the Association for Computational Linguistics, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:248780497
- W. Zhu, “Leebert: Learned early exit for bert with cross-level optimization,” in Annual Meeting of the Association for Computational Linguistics, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:236459809
- J. Kong, J. Wang, L.-C. Yu, and X. Zhang, “Accelerating inference for pretrained language models by unified multi-perspective early exiting,” in International Conference on Computational Linguistics, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:252818912
- D. Ye, Y. Lin, Y. Huang, and M. Sun, “Tr-bert: Dynamic token reduction for accelerating bert inference,” in North American Chapter of the Association for Computational Linguistics, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:235097557
- D. Zeng, N. Du, T. Wang, Y. Xu, T. Lei, Z. Chen, and C. Cui, “Learning to skip for language modeling,” ArXiv, vol. abs/2311.15436, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:265456419
- Y. Wang, K. Chen, H. Tan, and K. Guo, “Tabi: An efficient multi-level inference system for large language models,” Proceedings of the Eighteenth European Conference on Computer Systems, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:258508784
- L. Chen, M. A. Zaharia, and J. Y. Zou, “Frugalgpt: How to use large language models while reducing cost and improving performance,” ArXiv, vol. abs/2305.05176, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:258564349
- J. Zhang, R. Krishna, A. H. Awadallah, and C. Wang, “Ecoassistant: Using llm assistant more affordably and accurately,” ArXiv, vol. abs/2310.03046, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:263671677
- B. Zhu, Y. Sheng, L. Zheng, C. W. Barrett, M. I. Jordan, and J. Jiao, “On optimal caching and model multiplexing for large model inference,” ArXiv, vol. abs/2306.02003, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:259075212
- M. Yue, J. Zhao, M. Zhang, L. Du, and Z. Yao, “Large language model cascades with mixture of thoughts representations for cost-efficient reasoning,” ArXiv, vol. abs/2310.03094, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:263671564
- D. Patel and G. Wong, “Gpt-4 architecture, infrastructure, training dataset, costs, vision, moe,” 2023, https://www.semianalysis.com/p/gpt-4-architecture-infrastructure.
- A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressand et al., “Mixtral of experts,” arXiv preprint arXiv:2401.04088, 2024.
- S. Rajbhandari, C. Li, Z. Yao, M. Zhang, R. Y. Aminabadi, A. A. Awan, J. Rasley, and Y. He, “Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale,” in International Conference on Machine Learning. PMLR, 2022, pp. 18 332–18 346.
- D. Dai, C. Deng, C. Zhao, R. X. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y. Wu, Z. Xie, Y. K. Li, P. Huang, F. Luo, C. Ruan, Z. Sui, and W. Liang, “Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models,” ArXiv, vol. abs/2401.06066, 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID:266933338
- N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,” arXiv preprint arXiv:1701.06538, 2017.
- D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen, “Gshard: Scaling giant models with conditional computation and automatic sharding,” arXiv preprint arXiv:2006.16668, 2020.
- W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,” The Journal of Machine Learning Research, vol. 23, no. 1, pp. 5232–5270, 2022.
- Y. Zhou, T. Lei, H. Liu, N. Du, Y. Huang, V. Zhao, A. M. Dai, Q. V. Le, J. Laudon et al., “Mixture-of-experts with expert choice routing,” Advances in Neural Information Processing Systems, vol. 35, pp. 7103–7114, 2022.
- A. Yang, J. Lin, R. Men, C. Zhou, L. Jiang, X. Jia, A. Wang, J. Zhang, J. Wang, Y. Li et al., “M6-t: Exploring sparse expert models and beyond,” arXiv preprint arXiv:2105.15082, 2021.
- Y. Zhou, N. Du, Y. Huang, D. Peng, C. Lan, D. Huang, S. Shakeri, D. So, A. M. Dai, Y. Lu et al., “Brainformers: Trading simplicity for efficiency,” in International Conference on Machine Learning. PMLR, 2023, pp. 42 531–42 542.
- R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,” Neural computation, vol. 3, no. 1, pp. 79–87, 1991.
- M. I. Jordan and R. A. Jacobs, “Hierarchical mixtures of experts and the em algorithm,” Neural computation, vol. 6, no. 2, pp. 181–214, 1994.
- A. Graves and A. Graves, “Long short-term memory,” Supervised sequence labelling with recurrent neural networks, pp. 37–45, 2012.
- Z. Chi, L. Dong, S. Huang, D. Dai, S. Ma, B. Patra, S. Singhal, P. Bajaj, X. Song, X.-L. Mao et al., “On the representation collapse of sparse mixture of experts,” Advances in Neural Information Processing Systems, vol. 35, pp. 34 600–34 613, 2022.
- Y. Xie, S. Huang, T. Chen, and F. Wei, “Moec: Mixture of expert clusters,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 11, 2023, pp. 13 807–13 815.
- M. Lewis, S. Bhosale, T. Dettmers, N. Goyal, and L. Zettlemoyer, “Base layers: Simplifying training of large, sparse models,” in International Conference on Machine Learning. PMLR, 2021, pp. 6265–6274.
- A. Clark, D. De Las Casas, A. Guy, A. Mensch, M. Paganini, J. Hoffmann, B. Damoc, B. Hechtman, T. Cai, S. Borgeaud et al., “Unified scaling laws for routed language models,” in International Conference on Machine Learning. PMLR, 2022, pp. 4057–4086.
- S. Roller, S. Sukhbaatar, J. Weston et al., “Hash layers for large sparse models,” Advances in Neural Information Processing Systems, vol. 34, pp. 17 555–17 566, 2021.
- C. N. dos Santos, J. Lee-Thorp, I. Noble, C.-C. Chang, and D. Uthus, “Memory augmented language models through mixture of word experts,” ArXiv, vol. abs/2311.10768, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:265295488
- X. Ren, P. Zhou, X. Meng, X. Huang, Y. Wang, W. Wang, P. Li, X. Zhang, A. V. Podolskiy, G. Arshinov, A. Bout, I. Piontkovskaya, J. Wei, X. Jiang, T. Su, Q. Liu, and J. Yao, “Pangu-σ𝜎\sigmaitalic_σ: Towards trillion parameter language model with sparse heterogeneous computing,” ArXiv, vol. abs/2303.10845, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:257666647
- J. Li, Z. Sun, X. He, L. Zeng, Y. Lin, E. Li, B. Zheng, R. Zhao, and X. Chen, “Locmoe: A low-overhead moe for large language model training,” 2024. [Online]. Available: https://api.semanticscholar.org/CorpusID:267212059
- S. Zuo, X. Liu, J. Jiao, Y. J. Kim, H. Hassan, R. Zhang, T. Zhao, and J. Gao, “Taming sparsely activated transformer with stochastic experts,” arXiv preprint arXiv:2110.04260, 2021.
- Y. J. Kim, A. A. Awan, A. Muzio, A. F. C. Salinas, L. Lu, A. Hendy, S. Rajbhandari, Y. He, and H. H. Awadalla, “Scalable and efficient moe training for multitask multilingual models,” arXiv preprint arXiv:2109.10465, 2021.
- C. Riquelme, J. Puigcerver, B. Mustafa, M. Neumann, R. Jenatton, A. Susano Pinto, D. Keysers, and N. Houlsby, “Scaling vision with sparse mixture of experts,” Advances in Neural Information Processing Systems, vol. 34, pp. 8583–8595, 2021.
- B. Zoph, I. Bello, S. Kumar, N. Du, Y. Huang, J. Dean, N. Shazeer, and W. Fedus, “St-moe: Designing stable and transferable sparse expert models,” arXiv preprint arXiv:2202.08906, 2022.
- H. Hazimeh, Z. Zhao, A. Chowdhery, M. Sathiamoorthy, Y. Chen, R. Mazumder, L. Hong, and E. Chi, “Dselect-k: Differentiable selection in the mixture of experts with applications to multi-task learning,” Advances in Neural Information Processing Systems, vol. 34, pp. 29 335–29 347, 2021.
- N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat et al., “Glam: Efficient scaling of language models with mixture-of-experts,” in International Conference on Machine Learning. PMLR, 2022, pp. 5547–5569.
- M. Artetxe, S. Bhosale, N. Goyal, T. Mihaylov, M. Ott, S. Shleifer, X. V. Lin, J. Du, S. Iyer, R. Pasunuru et al., “Efficient large scale language modeling with mixtures of experts,” arXiv preprint arXiv:2112.10684, 2021.
- D. Dai, L. Dong, S. Ma, B. Zheng, Z. Sui, B. Chang, and F. Wei, “Stablemoe: Stable routing strategy for mixture of experts,” arXiv preprint arXiv:2204.08396, 2022.
- M. R. Costa-jussà, J. Cross, O. Çelebi, M. Elbayad, K. Heafield, K. Heffernan, E. Kalbassi, J. Lam, D. Licht, J. Maillard et al., “No language left behind: Scaling human-centered machine translation,” arXiv preprint arXiv:2207.04672, 2022.
- M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in International conference on machine learning. PMLR, 2019, pp. 6105–6114.
- O. Press, N. A. Smith, and O. Levy, “Improving transformer models by reordering their sublayers,” arXiv preprint arXiv:1911.03864, 2019.
- D. So, W. Mańke, H. Liu, Z. Dai, N. Shazeer, and Q. V. Le, “Searching for efficient transformers for language modeling,” Advances in Neural Information Processing Systems, vol. 34, pp. 6010–6022, 2021.
- B. Zoph and Q. V. Le, “Neural architecture search with reinforcement learning,” arXiv preprint arXiv:1611.01578, 2016.
- X. Nie, X. Miao, S. Cao, L. Ma, Q. Liu, J. Xue, Y. Miao, Y. Liu, Z. Yang, and B. Cui, “Evomoe: An evolutional mixture-of-experts training framework via dense-to-sparse gate,” arXiv preprint arXiv:2112.14397, 2021.
- N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.
- R. Liu, Y. J. Kim, A. Muzio, and H. Hassan, “Gating dropout: Communication-efficient regularization for sparsely activated transformers,” in International Conference on Machine Learning. PMLR, 2022, pp. 13 782–13 792.
- Z. Zhang, Y. Gu, X. Han, S. Chen, C. Xiao, Z. Sun, Y. Yao, F. Qi, J. Guan, P. Ke et al., “Cpm-2: Large-scale cost-effective pre-trained language models,” AI Open, vol. 2, pp. 216–224, 2021.
- F. Xue, Z. Zheng, Y. Fu, J. Ni, Z. Zheng, W. Zhou, and Y. You, “Openmoe: An early effort on open mixture-of-experts language models,” 2024.
- T. Chen, S. Huang, Y. Xie, B. Jiao, D. Jiang, H. Zhou, J. Li, and F. Wei, “Task-specific expert pruning for sparse mixture-of-experts,” arXiv preprint arXiv:2206.00277, 2022.
- Z.-F. Gao, P. Liu, W. X. Zhao, Z.-Y. Lu, and J.-R. Wen, “Parameter-efficient mixture-of-experts architecture for pre-trained language models,” arXiv preprint arXiv:2203.01104, 2022.
- S. Zuo, Q. Zhang, C. Liang, P. He, T. Zhao, and W. Chen, “Moebert: from bert to mixture-of-experts via importance-guided adaptation,” arXiv preprint arXiv:2204.07675, 2022.
- P. Molchanov, A. Mallya, S. Tyree, I. Frosio, and J. Kautz, “Importance estimation for neural network pruning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 11 264–11 272.
- Z. Zhang, Y. Lin, Z. Liu, P. Li, M. Sun, and J. Zhou, “Moefication: Transformer feed-forward layers are mixtures of experts,” arXiv preprint arXiv:2110.01786, 2021.
- R. Csord’as, K. Irie, and J. Schmidhuber, “Approximating two-layer feedforward networks for efficient transformers,” ArXiv, vol. abs/2310.10837, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:264172384
- R. Csord’as, P. Piekos, K. Irie, and J. Schmidhuber, “Switchhead: Accelerating transformers with mixture-of-experts attention,” ArXiv, vol. abs/2312.07987, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:266191825
- I. O. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. Zhai, T. Unterthiner, J. Yung, A. Steiner, D. Keysers, J. Uszkoreit et al., “Mlp-mixer: An all-mlp architecture for vision,” Advances in neural information processing systems, vol. 34, pp. 24 261–24 272, 2021.
- J. Lee-Thorp and J. Ainslie, “Sparse mixers: Combining moe and mixing to build a more efficient bert,” arXiv preprint arXiv:2205.12399, 2022.
- P. Yu, M. Artetxe, M. Ott, S. Shleifer, H. Gong, V. Stoyanov, and X. Li, “Efficient language modeling with sparse all-mlp,” arXiv preprint arXiv:2203.06850, 2022.
- Y. Wang, S. Agarwal, S. Mukherjee, X. Liu, J. Gao, A. H. Awadallah, and J. Gao, “Adamix: Mixture-of-adaptations for parameter-efficient model tuning,” arXiv preprint arXiv:2210.17451, 2022.
- N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for nlp,” in International Conference on Machine Learning. PMLR, 2019, pp. 2790–2799.
- E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.
- S. Diao, T. Xu, R. Xu, J. Wang, and T. Zhang, “Mixture-of-domain-adapters: Decoupling and injecting domain knowledge to pre-trained language models’ memories,” in Annual Meeting of the Association for Computational Linguistics, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:259108831
- R. Li, G. Murray, and G. Carenini, “Mixture-of-linguistic-experts adapters for improving and interpreting pre-trained language models,” in Conference on Empirical Methods in Natural Language Processing, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:264487239
- Y. Zhu, N. Wichers, C.-C. Lin, X. Wang, T. Chen, L. Shu, H. Lu, C. Liu, L. Luo, J. Chen, and L. Meng, “Sira: Sparse mixture of low rank adaptation,” ArXiv, vol. abs/2311.09179, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:265213347
- S. Dou, E. Zhou, Y. Liu, S. Gao, J. Zhao, W. Shen, Y. Zhou, Z. Xi, X. Wang, X. Fan, S. Pu, J. Zhu, R. Zheng, T. Gui, Q. Zhang, and X. Huang, “Loramoe: Revolutionizing mixture of experts for maintaining world knowledge in language model alignment,” ArXiv, vol. abs/2312.09979, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:266335873
- Y. Gui, X. Yan, P. Yin, H. Yang, and J. Cheng, “Spt: Fine-tuning transformer-based language models efficiently with sparsification,” ArXiv, vol. abs/2312.10365, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:266348310
- W. Niu, J. Guan, Y. Wang, G. Agrawal, and B. Ren, “Dnnfusion: accelerating deep neural networks execution with advanced operator fusion,” in Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, 2021, pp. 883–898.
- R. Y. Aminabadi, S. Rajbhandari, A. A. Awan, C. Li, D. Li, E. Zheng, O. Ruwase, S. Smith, M. Zhang, J. Rasley et al., “Deepspeed-inference: enabling efficient inference of transformer models at unprecedented scale,” in SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2022, pp. 1–15.
- J. Fang, Y. Yu, C. Zhao, and J. Zhou, “Turbotransformers: an efficient gpu serving system for transformer models,” in Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021, pp. 389–402.
- Y. Zhai, C. Jiang, L. Wang, X. Jia, S. Zhang, Z. Chen, X. Liu, and Y. Zhu, “Bytetransformer: A high-performance transformer boosted for variable-length inputs,” in 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2023, pp. 344–355.
- Y. Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, B. Chen, P. Liang, C. Ré, I. Stoica, and C. Zhang, “Flexgen: High-throughput generative inference of large language models with a single gpu,” in International Conference on Machine Learning. PMLR, 2023, pp. 31 094–31 116.
- Y. Song, Z. Mi, H. Xie, and H. Chen, “Powerinfer: Fast large language model serving with a consumer-grade gpu,” arXiv preprint arXiv:2312.12456, 2023.
- L. Zheng, Z. Li, H. Zhang, Y. Zhuang, Z. Chen, Y. Huang, Y. Wang, Y. Xu, D. Zhuo, E. P. Xing et al., “Alpa: Automating inter-and {{\{{Intra-Operator}}\}} parallelism for distributed deep learning,” in 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 2022, pp. 559–578.
- M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-lm: Training multi-billion parameter language models using model parallelism,” arXiv preprint arXiv:1909.08053, 2019.
- S. Li, H. Liu, Z. Bian, J. Fang, H. Huang, Y. Liu, B. Wang, and Y. You, “Colossal-ai: A unified deep learning system for large-scale parallel training,” in Proceedings of the 52nd International Conference on Parallel Processing, 2023, pp. 766–775.
- M. Baines, S. Bhosale, V. Caggiano, N. Goyal, S. Goyal, M. Ott, B. Lefaudeux, V. Liptchinsky, M. Rabbat, S. Sheiffer et al., “Fairscale: A general purpose modular pytorch library for high performance and large scale training,” 2021, https://github.com/facebookresearch/fairscale.
- G. Lai, “Pax: A jax-based machine learning framework for large scale models.” https://github.com/google/paxml.
- J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He, “Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 3505–3506.
- T. M. M. Team, “composer,” https://github.com/mosaicml/composer/, 2021.
- A. Pham, C. Yang, S. Sheng, S. Zhao, S. Lee, B. Jiang, F. Dong, X. Guan, and F. Ming, “Openllm: Operating llms in production,” 2023, https://github.com/bentoml/OpenLLM.
- T. A. B. Team, “Rayllm,” https://github.com/ray-project/ray-llm.
- M. team, “MLC-LLM,” 2023. [Online]. Available: https://github.com/mlc-ai/mlc-llm
- T. W. J. Team, “Saxml,” https://github.com/google/saxml.
- K. Yang, Z. Liu, and P. Cheng, “MOSEC: Model Serving made Efficient in the Cloud,” 2021. [Online]. Available: https://github.com/mosecorg/mosec
- T. D. K. Team, “Llm foundry,” https://github.com/mosaicml/llm-foundry.
- TensorFlow, “Tensorflow xla,” https://www.tensorflow.org/xla.
- T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y. Hu, L. Ceze et al., “{{\{{TVM}}\}}: An automated {{\{{End-to-End}}\}} optimizing compiler for deep learning,” in 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), 2018, pp. 578–594.
- X. Jiang, H. Wang, Y. Chen, Z. Wu, L. Wang, B. Zou, Y. Yang, Z. Cui, Y. Cai, T. Yu et al., “Mnn: A universal and efficient inference engine,” Proceedings of Machine Learning and Systems, vol. 2, pp. 1–13, 2020.
- Pytorch, “Pytorch jit,” https://github.com/pytorch/torchdynamo.
- A. Rücklé, G. Geigle, M. Glockner, T. Beck, J. Pfeiffer, N. Reimers, and I. Gurevych, “Adapterdrop: On the efficiency of adapters in transformers,” arXiv preprint arXiv:2010.11918, 2020.
- J. Pfeiffer, A. Kamath, A. Rücklé, K. Cho, and I. Gurevych, “Adapterfusion: Non-destructive task composition for transfer learning,” arXiv preprint arXiv:2005.00247, 2020.
- S. He, L. Ding, D. Dong, M. Zhang, and D. Tao, “Sparseadapter: An easy approach for improving the parameter-efficiency of adapters,” arXiv preprint arXiv:2210.04284, 2022.
- X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” arXiv preprint arXiv:2101.00190, 2021.
- E. B. Zaken, S. Ravfogel, and Y. Goldberg, “Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models,” arXiv preprint arXiv:2106.10199, 2021.
- M. Valipour, M. Rezagholizadeh, I. Kobyzev, and A. Ghodsi, “Dylora: Parameter efficient tuning of pre-trained models using dynamic search-free low-rank adaptation,” arXiv preprint arXiv:2210.07558, 2022.
- Q. Zhang, M. Chen, A. Bukharin, P. He, Y. Cheng, W. Chen, and T. Zhao, “Adaptive budget allocation for parameter-efficient fine-tuning,” arXiv preprint arXiv:2303.10512, 2023.
- R. Karimi Mahabadi, J. Henderson, and S. Ruder, “Compacter: Efficient low-rank hypercomplex adapter layers,” Advances in Neural Information Processing Systems, vol. 34, pp. 1022–1035, 2021.
- V. Lialin, V. Deshpande, and A. Rumshisky, “Scaling down to scale up: A guide to parameter-efficient fine-tuning,” arXiv preprint arXiv:2303.15647, 2023.
- Wenxiao Wang (63 papers)
- Wei Chen (1288 papers)
- Yicong Luo (1 paper)
- Yongliu Long (1 paper)
- Zhengkai Lin (2 papers)
- Liye Zhang (4 papers)
- Binbin Lin (50 papers)
- Deng Cai (181 papers)
- Xiaofei He (70 papers)