Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MULTIFLOW: Shifting Towards Task-Agnostic Vision-Language Pruning (2404.05621v1)

Published 8 Apr 2024 in cs.CV

Abstract: While excellent in transfer learning, Vision-LLMs (VLMs) come with high computational costs due to their large number of parameters. To address this issue, removing parameters via model pruning is a viable solution. However, existing techniques for VLMs are task-specific, and thus require pruning the network from scratch for each new task of interest. In this work, we explore a new direction: Task-Agnostic Vision-Language Pruning (TA-VLP). Given a pretrained VLM, the goal is to find a unique pruned counterpart transferable to multiple unknown downstream tasks. In this challenging setting, the transferable representations already encoded in the pretrained model are a key aspect to preserve. Thus, we propose Multimodal Flow Pruning (MULTIFLOW), a first, gradient-free, pruning framework for TA-VLP where: (i) the importance of a parameter is expressed in terms of its magnitude and its information flow, by incorporating the saliency of the neurons it connects; and (ii) pruning is driven by the emergent (multimodal) distribution of the VLM parameters after pretraining. We benchmark eight state-of-the-art pruning algorithms in the context of TA-VLP, experimenting with two VLMs, three vision-language tasks, and three pruning ratios. Our experimental results show that MULTIFLOW outperforms recent sophisticated, combinatorial competitors in the vast majority of the cases, paving the way towards addressing TA-VLP. The code is publicly available at https://github.com/FarinaMatteo/multiflow.

Task-Agnostic Vision-Language Pruning: A Critical Exploration of Multimodal Flow Pruning

In recent years, Vision-LLMs (VLMs) have demonstrated remarkable transfer learning capabilities across various tasks, often achieving state-of-the-art performance. However, these models are inherently parameter-heavy and computationally intensive, presenting significant challenges for deployment in resource-constrained environments. In the paper "multiflow: Shifting Towards Task-Agnostic Vision-Language Pruning," the authors address this issue through a novel pruning framework aimed at maintaining the transferability of VLMs across multiple tasks without the need to re-prune for each new task.

The paper introduces Multimodal Flow Pruning (multiflow), a gradient-free method for Task-Agnostic Vision-Language Pruning (TA-VLP). The primary objective of TA-VLP is to create a pruned version of a VLM that retains its efficacy across diverse downstream tasks without requiring recalibration for each specific task. This is a departure from traditional pruning methods, which typically necessitate task-specific knowledge and thus, require repetitive pruning for different tasks, which is both inefficient and impractical.

Core Contributions and Methodology

  1. Task-Agnostic Pruning Formalization: The authors formalize the concept of Task-Agnostic Vision-Language Pruning. They define TA-VLP as the process of pruning a VLM in such a way that it remains adaptable to various unforeseen tasks, encouraging a single pruning phase for generalized use.
  2. Multimodal Flow Pruning Algorithm: multiflow stands out by focusing on the information flow within the model. The saliency of parameters is determined by the magnitude of weights and the flow of information through the network, which is represented as a bipartite graph. Each parameter's importance is established by the combination of its edge weight and the saliency of input/output nodes it connects, thus balancing local node importance with the overall informational transfer within the network.
  3. Incorporating Multimodal Priors: By considering the pretraining distribution of parameters and respecting multimodal characteristics of the learned representations, multiflow attempts to mitigate biases that may arise from pruning without these considerations. This is especially significant in TA-VLP, given the diverse roles different modalities play in learning and representation within VLMs.

Experimental Evaluation

To substantiate their claims, the authors conduct thorough evaluations involving two state-of-the-art VLM architectures (BLIP and XVLM) across three tasks—Image-Text Retrieval, Image Captioning, and Visual Question Answering. multiflow consistently demonstrates superior performance compared to eight alternative pruning algorithms, notably in maintaining robust performance at high sparsity levels.

Notably, multiflow performs exceptionally well at an extreme sparsity level of 90%, which is significant considering the challenge of retaining efficient task generalization capabilities under such stringent constraints. Additionally, the algorithm shows resilience against the severe performance drop commonly associated with high-degree parameter pruning, marking an important step towards the feasibility of TA-VLP in practical settings.

Implications and Future Directions

The paper's findings open several avenues for future research in AI. The approach of leveraging multimodal flows for pruning may be expanded further into other multimodal settings beyond vision and language. Additionally, investigating structured pruning methods that exploit the insights from multiflow could offer further improvements in deployability and efficiency, potentially optimizing for real-world scenarios where both computational and memory constraints exist.

The research underscores the necessity of revisiting existing pruning strategies, especially those reliant on task-specific optimization, through a more modular and task-agnostic lens. Consequently, multiflow's contribution marks a significant progression towards the efficient and universal application of VLMs across diverse domains.

In summary, this paper contributes a robust and practical pruning method, multiflow, that not only tackles the inefficiencies of task-specific pruning but also pioneers an approach grounded in the intrinsic multimodal nature of VLMs, fostering continued exploration and development in the field of AI model optimization.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. Prospect Pruning: Finding Trainable Weights at Initialization using Meta-Gradients. In International Conference on Learning Representations, 2022.
  2. SPICE: Semantic Propositional Image Caption Evaluation. In European Conference on Computer Vision. Springer, 2016.
  3. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic evaluation measures for Machine Translation and/or Summarization, 2005.
  4. Fast as CHITA: Neural Network Pruning with Combinatorial Optimization. In International Conference on Machine Learning, 2023.
  5. Learned Thresholds Token Merging and Pruning for Vision Transformers. Transactions on Machine Learning Research, 2023.
  6. Pruning convolutional neural networks with self-supervision. arXiv preprint arXiv:2001.03554, 2020.
  7. Emerging properties in self-supervised vision transformers. In IEEE Conference on Computer Vision and Pattern Recognition, 2021.
  8. Chaoqi Wang and Guodong Zhang and Roger Grosse. Picking Winning Tickets Before Training by Preserving Gradient Flow. In International Conference on Learning Representations, 2020.
  9. Vision transformer slimming: Multi-dimension searching in continuous optimization space. In IEEE Conference on Computer Vision and Pattern Recognition, 2022.
  10. The lottery ticket hypothesis for pre-trained BERT networks. Advances in Neural Information Processing Systems, 2020a.
  11. Chasing sparsity in vision transformers: An end-to-end exploration. Advances in Neural Information Processing Systems, 2021a.
  12. The lottery tickets hypothesis for supervised and self-supervised pre-training in computer vision models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16306–16316, 2021b.
  13. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  14. Uniter: Universal image-text representation learning. In European Conference on Computer Vision, 2020b.
  15. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, 2009.
  16. Learning to prune deep neural networks via layer-wise optimal brain surgeon. Advances in Neural Information Processing Systems, 2017.
  17. An image is worth 16x16 words: Transformers for image recognition at scale. International Conference on Learning Representations, 2021.
  18. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. In International Conference on Learning Representations, 2019.
  19. Pruning Neural Networks at Initialization: Why Are We Missing the Mark? In International Conference on Learning Representations, 2021.
  20. SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot. In International Conference on Machine Learning, 2023.
  21. The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574, 2019.
  22. Sparse gpu kernels for deep learning. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020.
  23. Playing lottery tickets with vision and language. In Proceedings of the AAAI Conference on Artificial Intelligence, 2022.
  24. Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning. In Proceedings of the 5th Workshop on Representation Learning for NLP, 2020.
  25. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  26. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. Advances in Neural Information Processing Systems, 2015a.
  27. Learning both weights and connections for efficient neural network. Advances in Neural Information Processing Systems, 2015b.
  28. Second order derivatives for network pruning: Optimal brain surgeon. Advances in Neural Information Processing Systems, 1992.
  29. The emergence of essential sparsity in large pre-trained models: The weights that matter. Advances in Neural Information Processing Systems, 2023.
  30. Deep visual-semantic alignments for generating image descriptions. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.
  31. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 2017.
  32. Learning multiple layers of features from tiny images. 2009.
  33. The optimal BERT surgeon: Scalable and accurate second-order pruning for large language models. Conference on Empirical Methods in Natural Language Processing, 2022.
  34. A Fast Post-Training Pruning Framework for Transformers. In Advances in Neural Information Processing Systems, 2022.
  35. Optimal brain damage. Advances in Neural Information Processing Systems, 1989.
  36. Layer-adaptive Sparsity for the Magnitude-based Pruning. In International Conference on Learning Representations, 2021.
  37. SNIP: Single-Shot Network Pruning based on connection sensitivity. In International Conference on Learning Representations, 2019.
  38. Align before fuse: Vision and language representation learning with momentum distillation. Advances in Neural Information Processing Systems, 2021.
  39. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, 2022.
  40. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning, 2023.
  41. EViT: Expediting Vision Transformers via Token Reorganizations. In International Conference on Learning Representations, 2022.
  42. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision. Springer, 2014.
  43. Learning to win lottery tickets in bert transfer via task-agnostic mask training. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022.
  44. Piggyback: Adapting a single network to multiple tasks by learning to mask weights. In European Conference on Computer Vision, 2018.
  45. AdaViT: Adaptive Vision Transformers for Efficient Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2022.
  46. Are sixteen heads really better than one? In Advances in Neural Information Processing Systems, 2019.
  47. Pruning Convolutional Neural Networks for Resource Efficient Inference. In International Conference on Learning Representations, 2017.
  48. Importance estimation for neural network pruning. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.
  49. Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics & Image Processing, 2008.
  50. Im2Text: Describing images using 1 million captioned photographs. Advances in Neural Information Processing Systems, 2011.
  51. BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002.
  52. Pau de Jorge and Amartya Sanyal and Harkirat Singh Behl and Philip H. S. Torr and Grégory Rogez and Puneet Kumar Dokania. Progressive Skeletonization: Trimming more fat from a network at initialization. In International Conference on Learning Representations, 2021.
  53. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021.
  54. DynamicVIT: Efficient vision transformers with dynamic token sparsification. Advances in Neural Information Processing Systems, 2021.
  55. Comparing Rewinding and Fine-tuning in Neural Network Pruning. In International Conference on Learning Representations, 2020.
  56. Movement pruning: Adaptive sparsity by fine-tuning. Advances in Neural Information Processing Systems, 2020.
  57. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018.
  58. UPop: Unified and progressive pruning for compressing vision-language transformers. In International Conference on Machine Learning, 2023.
  59. WoodFisher: Efficient second-order approximation for neural network compression. Advances in Neural Information Processing Systems, 2020.
  60. A Simple and Effective Pruning Approach for Large Language Models. International Conference on Learning Representations, 2024.
  61. Pruning neural networks without any data by iteratively conserving synaptic flow. In Advances in Neural Information Processing Systems, 2020.
  62. Patch slimming for efficient vision transformers. In IEEE Conference on Computer Vision and Pattern Recognition, 2022.
  63. Any-to-any generation via composable diffusion. Advances in Neural Information Processing Systems, 2024.
  64. CIDEr: Consensus-based Image Description Evaluation. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.
  65. EfficientVLM: Fast and accurate vision-language models via knowledge distillation and modal-adaptive pruning. In Findings of the Association for Computational Linguistics: ACL 2023, 2023.
  66. From dense to sparse: Contrastive pruning for better pre-trained language model compression. In Proceedings of the AAAI Conference on Artificial Intelligence, 2022.
  67. Global Vision Transformer Pruning With Hessian-Aware Saliency. In IEEE Conference on Computer Vision and Pattern Recognition, 2023.
  68. Yite Wang and Dawei Li and Ruoyu Sun. NTK-SAP: Improving neural network pruning by aligning training dynamics. In International Conference on Learning Representations, 2023.
  69. The combinatorial brain surgeon: pruning weights that cancel one another in neural networks. In International Conference on Machine Learning, 2022.
  70. Prune once for all: Sparse pre-trained language models. Advances in Neural Information Processing Systems, 2021.
  71. Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts. In International Conference on Machine Learning, 2022.
  72. To prune, or not to prune: exploring the efficacy of pruning for model compression. International Conference on Learning Representations, Workshop Track Proceedings, 2018.
  73. Vision transformer pruning. arXiv preprint arXiv:2104.08500, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Matteo Farina (6 papers)
  2. Massimiliano Mancini (66 papers)
  3. Elia Cunegatti (8 papers)
  4. Gaowen Liu (60 papers)
  5. Giovanni Iacca (44 papers)
  6. Elisa Ricci (137 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com