Large Multimodal Model Compression via Efficient Pruning and Distillation at AntGroup (2312.05795v2)
Abstract: The deployment of Large Multimodal Models (LMMs) within AntGroup has significantly advanced multimodal tasks in payment, security, and advertising, notably enhancing advertisement audition tasks in Alipay. However, the deployment of such sizable models introduces challenges, particularly in increased latency and carbon emissions, which are antithetical to the ideals of Green AI. This paper introduces a novel multi-stage compression strategy for our proprietary LLM, AntGMM. Our methodology pivots on three main aspects: employing small training sample sizes, addressing multi-level redundancy through multi-stage pruning, and introducing an advanced distillation loss design. In our research, we constructed a dataset, the Multimodal Advertisement Audition Dataset (MAAD), from real-world scenarios within Alipay, and conducted experiments to validate the reliability of our proposed strategy. Furthermore, the effectiveness of our strategy is evident in its operational success in Alipay's real-world multimodal advertisement audition for three months from September 2023. Notably, our approach achieved a substantial reduction in latency, decreasing it from 700ms to 90ms, while maintaining online performance with only a slight performance decrease. Moreover, our compressed model is estimated to reduce electricity consumption by approximately 75 million kWh annually compared to the direct deployment of AntGMM, demonstrating our commitment to green AI initiatives. We will publicly release our code and the MAAD dataset after some reviews\footnote{https://github.com/MorinW/AntGMM$\_$Pruning}.
- OpenLEAF: Open-Domain Interleaved Image-Text Generation and Evaluation. arXiv preprint arXiv:2310.07749 (2023).
- Structured pruning of deep convolutional neural networks. ACM Journal on Emerging Technologies in Computing Systems (JETC) 13, 3 (2017), 1–18.
- Qwen technical report. arXiv preprint arXiv:2309.16609 (2023).
- Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
- SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference. arXiv preprint arXiv:2307.02628 (2023).
- The Use of ChatGPT to Assist in Diagnosing Glaucoma Based on Clinical Case Reports. Ophthalmology and Therapy (2023), 1–12.
- Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339 (2022).
- Reducing transformer depth on demand with structured dropout. arXiv preprint arXiv:1909.11556 (2019).
- DepGraph: Towards Any Structural Pruning. The IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023).
- Elias Frantar and Dan Alistarh. 2023. SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot. (2023).
- Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning. ACL 2020 (2020), 143.
- Knowledge Distillation of Large Language Models. arXiv preprint arXiv:2306.08543 (2023).
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
- Dynabert: Dynamic bert with adaptive width and depth. Advances in Neural Information Processing Systems 33 (2020), 9782–9793.
- TinyBERT: Distilling BERT for Natural Language Understanding. In EMNLP 2020. 4163–4174.
- I-bert: Integer-only bert quantization. In International conference on machine learning. PMLR, 5506–5518.
- Structured compression by weight encryption for unstructured pruning and quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1909–1918.
- Layer-adaptive sparsity for the magnitude-based pruning. arXiv preprint arXiv:2010.07611 (2020).
- Efficient transformer-based large scale language representations using hardware-friendly block structured pruning. arXiv preprint arXiv:2009.08065 (2020).
- Llava-med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890 (2023).
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023).
- Homodistil: Homotopic task-agnostic distillation of pre-trained transformers. arXiv preprint arXiv:2302.09632 (2023).
- EBERT: Efficient BERT inference with dynamic structured pruning. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 4814–4823.
- LLM-QAT: Data-Free Quantization Aware Training for Large Language Models. arXiv preprint arXiv:2305.17888 (2023).
- Knowledge distillation and data augmentation for NLP light pre-trained models. In Journal of Physics: Conference Series, Vol. 1651. IOP Publishing, 012043.
- LLM-Pruner: On the Structural Pruning of Large Language Models. In Advances in Neural Information Processing Systems.
- Structured pruning of a bert-based question answering model. arXiv preprint arXiv:1910.06360 (2019).
- Importance estimation for neural network pruning. In Proceedings of the CVPR. 11264–11272.
- Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
- Q-bert: Hessian based ultra low precision quantization of bert. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 8815–8821.
- Jonathon Shlens. 2014. Notes on kullback-leibler divergence and likelihood. arXiv preprint arXiv:1404.2000 (2014).
- Distilling reasoning capabilities into smaller language models. In Findings of the Association for Computational Linguistics: ACL 2023. 7059–7073.
- A Simple and Effective Pruning Approach for Large Language Models. arXiv preprint arXiv:2306.11695 (2023).
- Patient Knowledge Distillation for BERT Model Compression. In (EMNLP-IJCNLP). Association for Computational Linguistics.
- Prune and Tune: Improving Efficient Pruning Techniques for Massive Language Models. (2023).
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
- Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. arXiv preprint arXiv:1905.09418 (2019).
- Eigendamage: Structured pruning in the kronecker-factored eigenbasis. In International conference on machine learning. PMLR, 6566–6575.
- Smoothquant: Accurate and efficient post-training quantization for large language models. In ICML. PMLR, 38087–38099.
- Canwen Xu and Julian McAuley. 2023. A survey on model compression and acceleration for pretrained language models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 10566–10575.
- Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305 (2023).
- mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178 (2023).
- Large Multi-modal Encoders for Recommendation. arXiv preprint arXiv:2310.20343 (2023).
- Prune once for all: Sparse pre-trained language models. arXiv preprint arXiv:2111.05754 (2021).
- CLIP2: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data. In Proceedings of the CVPR. 15244–15253.
- Distilling knowledge from well-informed soft labels for neural relation extraction. In Proceedings of the AAAI, Vol. 34. 9620–9627.
- On the Opportunities of Green Computing: A Survey. arXiv preprint arXiv:2311.00447 (2023).
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023).
- A survey on model compression for large language models. arXiv preprint arXiv:2308.07633 (2023).
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.