MULTIFLOW: Shifting Towards Task-Agnostic Vision-Language Pruning (2404.05621v1)

Published 8 Apr 2024 in cs.CV

Abstract: While excellent in transfer learning, Vision-LLMs (VLMs) come with high computational costs due to their large number of parameters. To address this issue, removing parameters via model pruning is a viable solution. However, existing techniques for VLMs are task-specific, and thus require pruning the network from scratch for each new task of interest. In this work, we explore a new direction: Task-Agnostic Vision-Language Pruning (TA-VLP). Given a pretrained VLM, the goal is to find a unique pruned counterpart transferable to multiple unknown downstream tasks. In this challenging setting, the transferable representations already encoded in the pretrained model are a key aspect to preserve. Thus, we propose Multimodal Flow Pruning (MULTIFLOW), a first, gradient-free, pruning framework for TA-VLP where: (i) the importance of a parameter is expressed in terms of its magnitude and its information flow, by incorporating the saliency of the neurons it connects; and (ii) pruning is driven by the emergent (multimodal) distribution of the VLM parameters after pretraining. We benchmark eight state-of-the-art pruning algorithms in the context of TA-VLP, experimenting with two VLMs, three vision-language tasks, and three pruning ratios. Our experimental results show that MULTIFLOW outperforms recent sophisticated, combinatorial competitors in the vast majority of the cases, paving the way towards addressing TA-VLP. The code is publicly available at https://github.com/FarinaMatteo/multiflow.

PDF HTML Abstract

Task-Agnostic Vision-Language Pruning: A Critical Exploration of Multimodal Flow Pruning

In recent years, Vision-LLMs (VLMs) have demonstrated remarkable transfer learning capabilities across various tasks, often achieving state-of-the-art performance. However, these models are inherently parameter-heavy and computationally intensive, presenting significant challenges for deployment in resource-constrained environments. In the paper "multiflow: Shifting Towards Task-Agnostic Vision-Language Pruning," the authors address this issue through a novel pruning framework aimed at maintaining the transferability of VLMs across multiple tasks without the need to re-prune for each new task.

The paper introduces Multimodal Flow Pruning (multiflow), a gradient-free method for Task-Agnostic Vision-Language Pruning (TA-VLP). The primary objective of TA-VLP is to create a pruned version of a VLM that retains its efficacy across diverse downstream tasks without requiring recalibration for each specific task. This is a departure from traditional pruning methods, which typically necessitate task-specific knowledge and thus, require repetitive pruning for different tasks, which is both inefficient and impractical.

Core Contributions and Methodology

Task-Agnostic Pruning Formalization: The authors formalize the concept of Task-Agnostic Vision-Language Pruning. They define TA-VLP as the process of pruning a VLM in such a way that it remains adaptable to various unforeseen tasks, encouraging a single pruning phase for generalized use.
Multimodal Flow Pruning Algorithm: multiflow stands out by focusing on the information flow within the model. The saliency of parameters is determined by the magnitude of weights and the flow of information through the network, which is represented as a bipartite graph. Each parameter's importance is established by the combination of its edge weight and the saliency of input/output nodes it connects, thus balancing local node importance with the overall informational transfer within the network.
Incorporating Multimodal Priors: By considering the pretraining distribution of parameters and respecting multimodal characteristics of the learned representations, multiflow attempts to mitigate biases that may arise from pruning without these considerations. This is especially significant in TA-VLP, given the diverse roles different modalities play in learning and representation within VLMs.

Experimental Evaluation

To substantiate their claims, the authors conduct thorough evaluations involving two state-of-the-art VLM architectures (BLIP and XVLM) across three tasks—Image-Text Retrieval, Image Captioning, and Visual Question Answering. multiflow consistently demonstrates superior performance compared to eight alternative pruning algorithms, notably in maintaining robust performance at high sparsity levels.

Notably, multiflow performs exceptionally well at an extreme sparsity level of 90%, which is significant considering the challenge of retaining efficient task generalization capabilities under such stringent constraints. Additionally, the algorithm shows resilience against the severe performance drop commonly associated with high-degree parameter pruning, marking an important step towards the feasibility of TA-VLP in practical settings.

Implications and Future Directions

The paper's findings open several avenues for future research in AI. The approach of leveraging multimodal flows for pruning may be expanded further into other multimodal settings beyond vision and language. Additionally, investigating structured pruning methods that exploit the insights from multiflow could offer further improvements in deployability and efficiency, potentially optimizing for real-world scenarios where both computational and memory constraints exist.

The research underscores the necessity of revisiting existing pruning strategies, especially those reliant on task-specific optimization, through a more modular and task-agnostic lens. Consequently, multiflow's contribution marks a significant progression towards the efficient and universal application of VLMs across diverse domains.

In summary, this paper contributes a robust and practical pruning method, multiflow, that not only tackles the inefficiencies of task-specific pruning but also pioneers an approach grounded in the intrinsic multimodal nature of VLMs, fostering continued exploration and development in the field of AI model optimization.