Structural Pruning of Large Vision Language Models: A Comprehensive Study on Pruning Dynamics, Recovery, and Data Efficiency

Published 27 Apr 2026 in cs.CL | (2604.24380v1)

Abstract: While Large Vision LLMs (LVLMs) demonstrate impressive capabilities, their substantial computational and memory requirements pose deployment challenges on resource-constrained edge devices. Current parameter reduction techniques primarily involve training LVLMs from small LLMs, but these methods offer limited flexibility and remain computationally intensive. We study a complementary route: compressing existing LVLMs by applying structured pruning to the LLM backbone, followed by lightweight recovery training. Specifically, we investigate two structural pruning paradigms: layerwise and widthwise pruning, and pair them with supervised finetuning and knowledge distillation on logits and hidden states. Additionally, we assess the feasibility of conducting recovery training with only a small fraction of the available data. Our results show that widthwise pruning generally maintains better performance in low-resource scenarios, where computational resources are limited or there is insufficient finetuning data. As for the recovery training, finetuning only the multimodal projector is sufficient at small compression levels. Furthermore, a combination of supervised finetuning and hidden-state distillation yields optimal recovery across various pruning levels. Notably, effective recovery can be achieved using just 5% of the original data, while retaining over 95% of the original performance. Through empirical study on three representative LVLM families ranging from 3B to 7B parameters, this study offers actionable insights for practitioners to compress LVLMs without extensive computation resources or sufficient data. The code base is available at https://github.com/YiranHuangIrene/VLMCompression.git.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces a structural pruning framework that employs both layerwise and widthwise strategies to reduce model size while targeting maintenance of core reasoning abilities.
The paper demonstrates that supervised finetuning, especially when combined with L2 hidden-state loss, recovers over 95% of original performance at mild compression ratios.
The paper reveals that widthwise pruning preserves multimodal reasoning more effectively than layerwise methods, achieving data-efficient recovery with as little as 5% of the training data.

Structural Pruning and Recovery Training for Large Vision LLMs

Motivation and Scope

Large Vision LLMs (LVLMs) demonstrate substantial capacity on visual, linguistic, and multimodal reasoning tasks but impose significant computational and memory burdens, complicating their deployment on resource-constrained hardware. Existing size reduction techniques primarily utilize training smaller models from scratch, but these strategies lack flexibility and incur considerable costs. This paper (2604.24380) proposes a systematic approach for compressing LVLMs by applying structured pruning to the language backbone—either layerwise or widthwise—followed by data- and compute-efficient recovery training, including supervised finetuning and knowledge distillation.

Structured Pruning Paradigms

The study operationalizes two distinct pruning schemes:

Layerwise pruning: Removes transformer blocks based on block influence metrics, leveraging the redundancy of deep layers.
Widthwise pruning: Eliminates attention heads and MLP neurons via dependency graph-based importance assessments, ensuring structural integrity of layerwise transformations.

A decision framework is articulated for selecting pruning and recovery strategies tailored to resource availability and desired compression ratios.

Figure 1: Decision flow for selecting LVLM pruning and recovery strategies according to hardware and fidelity requirements.

The empirical evaluation demonstrates that, in low-resource and zero-shot settings, widthwise pruning preserves model competency more effectively than layerwise pruning. Layerwise removal disproportionately impairs performance on benchmarks with complex free-form generation and reasoning, revealing the criticality of architectural depth for such tasks.

Figure 2: Widthwise pruning offers superior performance on reasoning-intensive benchmarks compared to layerwise approaches, especially at moderate sparsity ratios.

Recovery Training: Supervised Finetuning and Knowledge Distillation

Following pruning, performance recovery is essential to compensate for disrupted cross-modal alignment and degraded linguistic capabilities. The paper analyzes the efficacy of several recovery strategies:

Supervised Finetuning (SFT): Retraining on ground-truth responses restores modality alignment, particularly via finetuning the multimodal projector.
Knowledge Distillation (KD): Transfers output distributions or hidden-state representations from the unpruned teacher to the pruned student, utilizing KL, Reverse KL, and L2 objectives.

Key findings include:

SFT on the projector is sufficient for mild pruning (≤15%), recapturing >95% of original performance with minimal compute.
Beyond 30%, joint SFT (projector and LLM) becomes necessary as pruning damages both multimodal alignment and core language reasoning.
KD alone is less impactful at high pruning ratios; however, combining SFT and hidden-state L2 loss yields maximal robustness and recovery.
Figure 3: SFT only the multimodal projector recovers alignment at mild compression, but joint SFT is required for higher pruning ratios.

Figure 4: KD performs comparably to SFT at low pruning ratios but is suboptimal at high ratios; SFT+L2 hidden-state matching is optimal for robust recovery.

Figure 5: SFT+L2 loss outperforms other combinations in recovery training, consistently restoring performance across compression levels.

Data Efficiency and Calibration Sensitivity

The study systematically evaluates the data efficiency of recovery training, revealing that high-fidelity models (>90% performance retained) can be recovered using only 5% of the dataset for compression ratios up to 45%. For severe pruning (≥60%), larger fractions are required, but full data is only necessary in that extreme regime. Similarly, importance score calibration for pruning converges rapidly; as few as 10 samples suffice for reliable structural pruning.

Figure 6: Even minimal data (≥5%) enables effective recovery after moderate pruning; higher ratios demand proportionally more data.

Figure 7: Calibration sensitivity is minimal; model performance stabilizes after a small number of samples for importance score estimation.

Hardware and Latency Implications

Both layerwise and widthwise pruning confer equivalent gains in GPU memory and theoretical FLOPs at equal sparsity, but latency benefits differ. Layerwise pruning yields lower inference time by reducing network depth and kernel launches. However, this must be balanced against its inferior retention of reasoning capacity relative to widthwise pruning at comparable compression ratios.

Practical Guidelines and Implications

The paper distills actionable insights for practitioners:

Widthwise pruning should be preferred for standalone compression, especially where reasoning and open-ended generation are critical.
Layerwise pruning, combined with SFT, is effective for modest pruning (≤30%) when latency is the primary concern.
Combining SFT and L2 hidden-state KD achieves optimal recovery; output KD provides diminishing returns.
Data-efficient calibration and recovery are feasible, minimizing both compute and data requirements.
Structural pruning integrates readily with quantization (e.g., LLM.int8), enabling compounded memory reduction without substantial loss in performance.

Theoretical Significance and Future Directions

Findings reinforce the architectural vulnerability of transformer-based LVLMs to layer removal, with widthwise redundancy enabling safe compression without disrupting sequential processing chains essential for complex multimodal reasoning. These results generalize across representative LVLMs (LLaVA-v1.5-7B, Mini-InternVL-Chat-4B, Bunny-v1-3B), providing clarity on balancing structural fidelity, efficiency, and deployment constraints.

Future directions include:

Extending pruning and recovery techniques to vision encoders and multi-modal fusion modules.
Exploring unstructured and semi-structured pruning in the context of commodity hardware.
Scaling up to increasingly diverse datasets and benchmark suites, e.g., for AGI-oriented multimodal tasks.

Conclusion

This comprehensive study establishes a formal framework for structured pruning and lightweight recovery of LVLMs, delineating trade-offs between compression ratio, performance, and resource efficiency. The empirical and theoretical analyses substantiate widthwise pruning and combined SFT+L2 training as optimal for robust, data-efficient compression and deployment. Such insights equip researchers and practitioners to tailor LVLM architectures for real-world scenarios, accelerating progress towards efficient, performant multimodal AI systems.

Markdown Report Issue