Turbo: Informativity-Driven Acceleration Plug-In for Vision-Language Models (2312.07408v1)
Abstract: Vision-Language Large Models (VLMs) have become primary backbone of AI, due to the impressive performance. However, their expensive computation costs, i.e., throughput and delay, impede potentials in real-world scenarios. To achieve acceleration for VLMs, most existing methods focus on the model perspective: pruning, distillation, quantification, but completely overlook the data-perspective redundancy. To fill the overlook, this paper pioneers the severity of data redundancy, and designs one plug-and-play Turbo module guided by information degree to prune inefficient tokens from visual or textual data. In pursuit of efficiency-performance trade-offs, information degree takes two key factors into consideration: mutual redundancy and semantic value. Concretely, the former evaluates the data duplication between sequential tokens; while the latter evaluates each token by its contribution to the overall semantics. As a result, tokens with high information degree carry less redundancy and stronger semantics. For VLMs' calculation, Turbo works as a user-friendly plug-in that sorts data referring to information degree, utilizing only top-level ones to save costs. Its advantages are multifaceted, e.g., being generally compatible to various VLMs across understanding and generation, simple use without retraining and trivial engineering efforts. On multiple public VLMs benchmarks, we conduct extensive experiments to reveal the gratifying acceleration of Turbo, under negligible performance drop.
- Diverse weighted bipartite b-matching. arXiv preprint arXiv:1702.07134, 2017.
- Token merging: Your vit but faster. arXiv preprint arXiv:2210.09461, 2022.
- Online bipartite matching in offline time. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science. IEEE, 2014.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Enhancing cross-domain click-through rate prediction via explicit feature augmentation. arXiv preprint arXiv:2312.00078, 2023.
- Uniter: Universal image-text representation learning. In Proceedings of the European Conference on Computer Vision. Springer, 2020.
- Image to multi-modal retrieval learning for industrial application. arXiv preprint arXiv:2305.03972, 2023.
- Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
- Compressing visual-linguistic model via knowledge distillation. In Proceedings of the International Conference on Computer Vision, 2021.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 2020.
- Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the International Conference on Machine Learning, 2021.
- Trips: Efficient vision-and-language pre-training with text-relevant image patch selection. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4084–4096, 2022.
- Prompting visual-language models for efficient video understanding. In Proceedings of the European Conference on Computer Vision. Springer, 2022.
- Multi-modal prompting for low-shot temporal action localization. arXiv preprint arXiv:2303.11732, 2023.
- Constraint and union for partially-supervised temporal sentence grounding. arXiv preprint arXiv:2302.09850, 2023.
- Divide and conquer for single-frame temporal action localization. In Proceedings of the International Conference on Computer Vision, 2021.
- Adaptive mutual supervision for weakly-supervised temporal action localization. IEEE Transactions on Multimedia, 2022.
- Point-level temporal action localization: Bridging fully-supervised proposals to weakly-supervised losses. arXiv preprint arXiv:2012.08236, 2020.
- Distilling vision-language pre-training to collaborate with weakly-supervised temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023.
- Vilt: Vision-and-language transformer without convolution or region supervision. In Proceedings of the International Conference on Machine Learning. PMLR, 2021.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
- Audio-aware query-enhanced transformer for audio-visual segmentation. arXiv preprint arXiv:2307.13236, 2023.
- Exploiting transformation invariance and equivariance for self-supervised sound localisation. In Proceedings of ACM International Conference on Multimedia, 2022.
- Annotation-free audio-visual segmentation. arXiv preprint arXiv:2305.11019, 2023.
- Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 2022.
- Diffusionseg: Adapting diffusion towards unsupervised object discovery. arXiv preprint arXiv:2303.09813, 2023.
- Open-vocabulary semantic segmentation via attribute decomposition-aggregation. arXiv preprint arXiv:2309.00096, 2023.
- Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021.
- Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in Neural Information Processing Systems, 2021.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- Q-bert: Hessian based ultra low precision quantization of bert. In Proceedings of the AAAI Conference on Artificial Intelligence, 2020.
- Upop: Unified and progressive pruning for compressing vision-language transformers. arXiv preprint arXiv:2301.13741, 2023.
- Crossget: Cross-guided ensemble of tokens for accelerating vision-language transformers. arXiv preprint arXiv:2305.17455, 2023.
- Improving image captioning with better use of captions. arXiv preprint arXiv:2006.11807, 2020.
- Revisiting weakly supervised pre-training of visual perception models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 804–814, 2022.
- Clip models are few-shot learners: Empirical studies on vqa and visual entailment. arXiv preprint arXiv:2203.07190, 2022.
- How to train your vit? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270, 2021.
- Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
- Efficientvlm: Fast and accurate vision-language models via knowledge distillation and modal-adaptive pruning. arXiv preprint arXiv:2210.07795, 2022.
- Smarttrim: Adaptive tokens and parameters pruning for efficient vision-language models. arXiv preprint arXiv:2305.15033, 2023.
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
- Multi-modal prototypes for open-set semantic segmentation. arXiv preprint arXiv:2307.02003, 2023.
- Bottom-up temporal action localization with mutual regularization. In Proceedings of the European Conference on Computer Vision. Springer, 2020.
- Learning to prompt for vision-language models. International Journal of Computer Vision, 2022.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
- Chen Ju (26 papers)
- Haicheng Wang (12 papers)
- Zeqian Li (11 papers)
- Xu Chen (413 papers)
- Zhonghua Zhai (10 papers)
- Weilin Huang (61 papers)
- Shuai Xiao (31 papers)