Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Rethinking Pruning for Vision-Language Models: Strategies for Effective Sparsity and Performance Restoration (2404.02424v2)

Published 3 Apr 2024 in cs.LG and cs.CV

Abstract: Vision-LLMs (VLMs) integrate information from multiple modalities and have shown remarkable success across various tasks. However, deploying large-scale VLMs in resource-constrained scenarios is challenging. Pruning followed by finetuning offers a potential solution but remains underexplored for VLMs. This study addresses two key questions: how to distribute sparsity across different modality-specific models, and how to restore the performance of pruned sparse VLMs. Our preliminary studies identified two effective pruning settings: applying the same sparsity to both vision and LLMs, and pruning only the LLMs. While LoRA finetuning aims to restore sparse models, it faces challenges due to incompatibility with sparse models, disrupting the pruned sparsity. To overcome these issues, we propose SparseLoRA, which applies sparsity directly to LoRA weights. Our experimental results demonstrate significant improvements, including an 11.3\% boost under 2:4 sparsity and a 47.6\% enhancement under unstructured 70\% sparsity. Code is released at: \url{https://github.com/Shwai-He/VLM-Compression}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. nocaps: novel object captioning at scale. International Conference on Computer Vision, pp.  8947–8956, 2019. URL https://api.semanticscholar.org/CorpusID:56517630.
  2. Learning the number of neurons in deep networks. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
  3. Learning n:m fine-grained structured sparse neural networks from scratch. In International Conference on Learning Representations, 2021.
  4. Language models are few-shot learners, 2020.
  5. The lottery ticket hypothesis for pre-trained bert networks, 2020.
  6. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  7. Scaling instruction-finetuned language models. ArXiv, abs/2210.11416, 2022. URL https://api.semanticscholar.org/CorpusID:253018554.
  8. Compressing neural networks using the variational information bottleneck. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), 2018.
  9. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  10. Everybody prune now: Structured pruning of llms with only forward passes, 2024.
  11. Depgraph: Towards any structural pruning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  16091–16101, 2023.
  12. The lottery ticket hypothesis: Finding sparse, trainable neural networks, 2019.
  13. SparseGPT: Massive language models can be accurately pruned in one-shot. arXiv preprint arXiv:2301.00774, 2023.
  14. An empirical investigation of catastrophic forgeting in gradient-based neural networks. In Yoshua Bengio and Yann LeCun (eds.), 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014. URL http://arxiv.org/abs/1312.6211.
  15. Knowledge distillation: A survey. International Journal of Computer Vision, 129(6):1789–1819, March 2021. ISSN 1573-1405. doi: 10.1007/s11263-021-01453-z. URL http://dx.doi.org/10.1007/s11263-021-01453-z.
  16. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. International Journal of Computer Vision, 127:398 – 414, 2016. URL https://api.semanticscholar.org/CorpusID:8081284.
  17. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding, 2016.
  18. SparseAdapter: An easy approach for improving the parameter-efficiency of adapters. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Findings of the Association for Computational Linguistics: EMNLP 2022, pp.  2184–2190, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-emnlp.160. URL https://aclanthology.org/2022.findings-emnlp.160.
  19. Sd-conv: Towards the parameter-efficiency of dynamic convolution, 2023.
  20. Structured pruning for deep convolutional neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp.  1–20, 2024. ISSN 1939-3539. doi: 10.1109/tpami.2023.3334614. URL http://dx.doi.org/10.1109/TPAMI.2023.3334614.
  21. Distilling the knowledge in a neural network, 2015.
  22. Parameter-efficient transfer learning for nlp, 2019.
  23. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  24. Gqa: A new dataset for real-world visual reasoning and compositional question answering. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  6693–6702, 2019. URL https://api.semanticscholar.org/CorpusID:152282269.
  25. Mixtral of experts, 2024.
  26. Multimodal residual learning for visual qa. In Neural Information Processing Systems, 2016. URL https://api.semanticscholar.org/CorpusID:1657806.
  27. Adam: A method for stochastic optimization. In ICLR, 2015. URL https://arxiv.org/abs/1412.6980.
  28. Soft threshold weight reparameterization for learnable sparsity, 2020.
  29. SNIP: SINGLE-SHOT NETWORK PRUNING BASED ON CONNECTION SENSITIVITY. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=B1VZqjAcYX.
  30. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
  31. Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
  32. Visual instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=w0H2xGHlkw.
  33. Group fisher pruning for practical network compression, 2021.
  34. Sora: A review on background, technology, limitations, and opportunities of large vision models, 2024.
  35. Llm-pruner: On the structural pruning of large language models. In Advances in Neural Information Processing Systems, 2023.
  36. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
  37. Ok-vqa: A visual question answering benchmark requiring external knowledge. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  3190–3199, 2019. URL https://api.semanticscholar.org/CorpusID:173991173.
  38. Accelerating sparse deep neural networks. arXiv preprint arXiv:2104.08378, 2021.
  39. A comprehensive analysis of adapter efficiency, 2023.
  40. Gpt-4 technical report, 2024.
  41. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. International Journal of Computer Vision, 123:74 – 93, 2015. URL https://api.semanticscholar.org/CorpusID:6941275.
  42. Learning transferable visual models from natural language supervision, 2021.
  43. Adapterdrop: On the efficiency of adapters in transformers, 2021.
  44. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Iryna Gurevych and Yusuke Miyao (eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pp.  2556–2565. Association for Computational Linguistics, 2018. doi: 10.18653/V1/P18-1238. URL https://aclanthology.org/P18-1238/.
  45. Make-a-video: Text-to-video generation without text-video data, 2022.
  46. Does knowledge distillation really work?, 2021.
  47. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023a.
  48. Eva-clip: Improved training techniques for clip at scale. ArXiv, abs/2303.15389, 2023b. URL https://api.semanticscholar.org/CorpusID:257766387.
  49. Pruning neural networks without any data by iteratively conserving synaptic flow, 2020.
  50. Structured pruning of large language models. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  6151–6162, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.496. URL https://aclanthology.org/2020.emnlp-main.496.
  51. An empirical study of gpt-3 for few-shot knowledge-based vqa. In AAAI, 2022.
  52. Mohit Bansal Yi-Lin Sung, Jaehong Yoon. Ecoflap: Efficient coarse-to-fine layer-wise pruning for vision-language models. In International Conference on Learning Representations (ICLR), 2024.
  53. A survey on multimodal large language models, 2023.
  54. Learning best combination for efficient n:m sparsity. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=tbdk6XLYmZj.
  55. Dynamic sparse no training: Training-free fine-tuning for sparse llms, 2024.
  56. Lima: Less is more for alignment, 2023.
  57. Lafite: Towards language-free training for text-to-image generation. arXiv preprint arXiv:2111.13792, 2021.
  58. Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Shwai He (23 papers)
  2. Tianlong Chen (202 papers)
  3. Ang Li (472 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets