Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations (2407.05690v1)
Abstract: Structured pruning fundamentally reduces computational and memory overheads of LLMs and offers a feasible solution for end-side LLM deployment. Structurally pruned models remain dense and high-precision, highly compatible with further tuning and compression. However, as the coarse-grained structured pruning poses large damage to the highly interconnected model, achieving a high compression ratio for scaled-up LLMs remains a challenge. In this paper, we introduce a task-agnostic structured pruning approach coupled with a compact Transformer architecture design. The proposed approach, named TransAct, reduces transitional activations inside multi-head attention (MHA) and multi-layer perceptron (MLP) modules, while preserving the inter-module activations that are sensitive to perturbations. Hence, the LLM is pruned into an intra-module low-rank architecture, significantly reducing weights, KV Cache and attention computation. TransAct is implemented on the LLaMA model and evaluated on downstream benchmarks. Results verify the optimality of our approach at high compression with respect to both efficiency and performance. Further, ablation studies reveal the strength of activation-guided iterative pruning and provide experimental analysis on the redundancy of MHA and MLP modules.
- PIQA: reasoning about physical commonsense in natural language. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 7432–7439. AAAI Press.
- Medusa: Simple llm inference acceleration framework with multiple decoding heads. ArXiv preprint, abs/2401.10774.
- BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936, Minneapolis, Minnesota. Association for Computational Linguistics.
- Think you have solved question answering? try arc, the ai2 reasoning challenge.
- Skipdecode: Autoregressive skip decoding with batching and caching for efficient llm inference.
- Llm. int8 (): 8-bit matrix multiplication for transformers at scale. ArXiv preprint, abs/2208.07339.
- An algorithm–hardware co-optimized framework for accelerating n:m sparse transformers. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 30(11).
- Elias Frantar and Dan Alistarh. 2023. Sparsegpt: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning, pages 10323–10337. PMLR.
- Gptq: Accurate post-training quantization for generative pre-trained transformers. ArXiv preprint, abs/2210.17323.
- Albert Gu and Tri Dao. 2023. Mamba: Linear-time sequence modeling with selective state spaces. ArXiv preprint, abs/2312.00752.
- Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
- Compressing llms: The truth is rarely pure and never simple. ArXiv preprint, abs/2310.01382.
- Mixtral of experts. ArXiv preprint, abs/2401.04088.
- Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611.
- Efficient memory management for large language model serving with pagedattention.
- Relax: Composable abstractions for end-to-end dynamic machine learning. ArXiv preprint, abs/2311.02103.
- Bloom: A 176b-parameter open-access multilingual language model.
- Optimal brain damage. In Advances in Neural Information Processing Systems, volume 2. Morgan-Kaufmann.
- Gshard: Scaling giant models with conditional computation and automatic sharding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
- Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR.
- Awq: Activation-aware weight quantization for llm compression and acceleration. ArXiv preprint, abs/2306.00978.
- TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics.
- Logiqa: A challenge dataset for machine reading comprehension with logical reasoning. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020, pages 3622–3628. ijcai.org.
- Llm-qat: Data-free quantization aware training for large language models. ArXiv preprint, abs/2305.17888.
- Pipellm: Pipeline llm inference on heterogeneous devices with sequence slicing. In Proceedings of the ACM SIGCOMM 2023 Conference, pages 1126–1128.
- Llm-pruner: On the structural pruning of large language models. ArXiv preprint, abs/2305.11627.
- Pointer sentinel mixture models.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391.
- Relu strikes back: Exploiting activation sparsity in large language models.
- A white paper on neural network quantization. ArXiv preprint, abs/2106.08295.
- The lambada dataset: Word prediction requiring a broad discourse context. In The 54th Annual Meeting of the Association for Computational Linguistics Proceedings of the Conference: Vol. 1 Long Papers, volume 3, pages 1525–1534. ACL.
- Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048.
- Winogrande: An adversarial winograd schema challenge at scale. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 8732–8740. AAAI Press.
- Confident adaptive language modeling.
- Powerinfer: Fast large language model serving with a consumer-grade gpu.
- Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063.
- A simple and effective pruning approach for large language models. ArXiv preprint, abs/2306.11695.
- MLC team. 2023. MLC-LLM.
- Together Computer. 2023. Redpajama: an open dataset for training large language models.
- Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288.
- Crowdsourcing multiple choice science questions. In Proceedings of the 3rd Workshop on Noisy User-generated Text, NUT@EMNLP 2017, Copenhagen, Denmark, September 7, 2017, pages 94–106. Association for Computational Linguistics.
- Sheared LLaMA: Accelerating language model pre-training via structured pruning. In The Twelfth International Conference on Learning Representations.
- Smoothquant: Accurate and efficient post-training quantization for large language models.
- HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy. Association for Computational Linguistics.
- Opt: Open pre-trained transformer language models.
- Unveiling a core linguistic region in large language models. ArXiv preprint, abs/2310.14928.
- A survey on model compression for large language models. ArXiv preprint, abs/2308.07633.
- Perp: Rethinking the prune-retrain paradigm in the era of llms. ArXiv preprint, abs/2312.15230.
- Bowen Shen (23 papers)
- Zheng Lin (104 papers)
- Daren Zha (5 papers)
- Wei Liu (1135 papers)
- Jian Luan (50 papers)
- Bin Wang (750 papers)
- Weiping Wang (123 papers)