ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language Models (2310.02998v2)

Published 4 Oct 2023 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: Large Vision-LLMs (LVLMs) can understand the world comprehensively by integrating rich information from different modalities, achieving remarkable advancements on various multimodal downstream tasks. However, deploying LVLMs is often problematic due to their massive computational/energy costs and carbon consumption. Such issues make it infeasible to adopt conventional iterative global pruning, which is costly due to computing the Hessian matrix of the entire large model for sparsification. Alternatively, several studies have recently proposed layer-wise pruning approaches to avoid the expensive computation of global pruning and efficiently compress model weights according to their importance within a layer. However, they often suffer from suboptimal model compression due to their lack of a global perspective. To address this limitation in recent efficient pruning methods for large models, we propose Efficient Coarse-to-Fine LayerWise Pruning (ECoFLaP), a two-stage coarse-to-fine weight pruning approach for LVLMs. We first determine the sparsity ratios of different layers or blocks by leveraging the global importance score, which is efficiently computed based on the zeroth-order approximation of the global model gradients. Then, the model performs local layer-wise unstructured weight pruning based on globally-informed sparsity ratios. We validate our proposed method across various multimodal and unimodal models and datasets, demonstrating significant performance improvements over prevalent pruning techniques in the high-sparsity regime.

PDF Abstract

An Overview of ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-LLMs

The paper introduces Efficient Coarse-to-Fine Layer-Wise Pruning (ECoFLaP), a novel approach designed to enhance the efficiency of pruning in Large Vision-LLMs (LVLMs). This method addresses current challenges associated with pruning, namely, the high computational cost of global pruning methods and the suboptimal performance associated with layer-wise methods due to a lack of global perspective. The proposed approach combines the computational efficiency of layer-wise pruning with the global context awareness of global pruning methods, thereby attempting to deliver high-performance LVLMs while maintaining efficient resource use.

Methodology

ECoFLaP is characterized by a two-stage pruning framework: the coarse step and the fine step. The coarse step establishes sparsity ratios for different layers using a global importance score, which is computed efficiently through a zeroth-order approximation of the model’s global gradients. This step is crucial in determining which layers can afford to be pruned more aggressively without significantly impacting the model's overall performance. Once these layer-specific sparsity ratios are established, the fine step executes a layer-wise unstructured pruning operation, refining the model's weight allocation based on the globally-informed sparsity metrics.

This simultaneous optimization for both computational efficiency and model performance represents a methodological advancement in the field of model pruning. Previous methods, such as SparseGPT and Wanda, which operate on either purely local or purely global levels, have either failed to scale efficiently or compromise on optimal compression, respectively. ECoFLaP attempts to overcome these pitfalls by merging the strengths of both global and local pruning strategies.

Numerical Results

The implementation of ECoFLaP demonstrates clear performance improvements across diverse datasets and models. Notably, at a 50% sparsity level, ECoFLaP outperformed other state-of-the-art pruning techniques including SparseGPT and Wanda, with relative improvements of up to 5% across numerous vision-language tasks. Moreover, this approach even proved superior to UPop, a method specifically designed for vision-language transformers, showing enhancements of 1.8% and 2.6% on NLVR $^2$ and COCO caption datasets, respectively. Furthermore, it showcased robustness by achieving comparable gains in unimodal tasks utilizing FlanT5, LLaMA, and EVA-ViT architectures, particularly effective in high-sparsity regimes.

Practical and Theoretical Implications

Practically, the adoption of ECoFLaP can lead to significant reductions in computational requirements and energy consumption, making it viable for deployment in resource-constrained environments. This can widen the applicability of high-capacity models, which previously were limited by their resource demands.

Theoretically, the method sets a precedent for integrating zeroth-order optimization techniques into the pruning process, which can be further explored in future research. This integration shows promise in striking a balance between computational efficiency and performance fidelity in large-scale models.

The success of ECoFLaP on multiple datasets and model architectures hints at a robust generalization capability, suggesting its potential applicability across various domains beyond vision-LLMs. Future developments could explore further refinements in approximating global gradients and extending this framework to more sophisticated architectures, potentially including those with emerging modalities.

Overall, ECoFLaP presents a significant incremental step towards making LVLMs more accessible and sustainable without substantial sacrifices in performance, reflecting both an impactful and practically relevant advancement in the field of model pruning.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Yi-Lin Sung (14 papers)
Jaehong Yoon (43 papers)
Mohit Bansal (304 papers)

Citations (8)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

Tweets

https://twitter.com/yilin_sung/status/1752018050122174510