CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers (2305.17455v4)
Abstract: Recent vision-LLMs have achieved tremendous advances. However, their computational costs are also escalating dramatically, making model acceleration exceedingly critical. To pursue more efficient vision-language Transformers, this paper introduces Cross-Guided Ensemble of Tokens (CrossGET), a general acceleration framework for vision-language Transformers. This framework adaptively combines tokens in real-time during inference, significantly reducing computational costs while maintaining high performance. CrossGET features two primary innovations: 1) Cross-Guided Matching and Ensemble. CrossGET leverages cross-modal guided token matching and ensemble to effectively utilize cross-modal information, achieving wider applicability across both modality-independent models, e.g., CLIP, and modality-dependent ones, e.g., BLIP2. 2) Complete-Graph Soft Matching. CrossGET introduces an algorithm for the token-matching mechanism, ensuring reliable matching results while facilitating parallelizability and high efficiency. Extensive experiments have been conducted on various vision-language tasks, such as image-text retrieval, visual reasoning, image captioning, and visual question answering. The performance on both classic multimodal architectures and emerging multimodal LLMs demonstrates the framework's effectiveness and versatility. The code is available at https://github.com/sdc17/CrossGET.
- nocaps: novel object captioning at scale. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8948–8957, 2019.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
- Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pp. 2425–2433, 2015.
- Token merging: Your ViT but faster. In International Conference on Learning Representations, 2023.
- Vision transformer slimming: Multi-dimension searching in continuous optimization space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4931–4941, 2022.
- Adaptformer: Adapting vision transformers for scalable visual recognition. Advances in Neural Information Processing Systems, 35:16664–16678, 2022.
- Chasing sparsity in vision transformers: An end-to-end exploration. Advances in Neural Information Processing Systems, 34:19974–19988, 2021.
- Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
- Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2020.
- Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
- Imagenet: A large-scale hierarchical image database. In IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee, 2009.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Reducing transformer depth on demand with structured dropout. arXiv preprint arXiv:1909.11556, 2019.
- Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19358–19369, 2023.
- Compressing visual-linguistic model via knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1428–1438, 2021.
- Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pp. 178–178. IEEE, 2004.
- The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018.
- Optimal brain compression: A framework for accurate post-training quantization and pruning. arXiv preprint arXiv:2208.11580, 2022.
- Gptq: Accurate post-training quantization for generative pre-trained transformers, 2023.
- Playing lottery tickets with vision and language. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 652–660, 2022.
- Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
- Power-bert: Accelerating bert inference via progressive word-vector elimination. In International Conference on Machine Learning, pp. 3690–3699. PMLR, 2020.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6904–6913, 2017.
- Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
- Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366, 2021.
- Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE international conference on computer vision, pp. 1389–1397, 2017.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Charles AR Hoare. Quicksort. The computer journal, 5(1):10–16, 1962.
- Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pp. 2790–2799. PMLR, 2019.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Instance-aware image and sentence matching with selective multimodal lstm. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2310–2318, 2017.
- Fedpara: Low-rank hadamard product for communication-efficient federated learning. arXiv preprint arXiv:2108.06098, 2021.
- Visual prompt tuning. In European Conference on Computer Vision, pp. 709–727. Springer, 2022.
- Guiding long-short term memory for image caption generation, 2015. URL https://arxiv.org/abs/1509.04942.
- Trips: Efficient vision-and-language pre-training with text-relevant image patch selection. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 4084–4096, 2022.
- Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351, 2019.
- Compacter: Efficient low-rank hypercomplex adapter layers. Advances in Neural Information Processing Systems, 34:1022–1035, 2021.
- Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3128–3137, 2015.
- Deep fragment embeddings for bidirectional image sentence mapping. Advances in neural information processing systems, 27, 2014.
- Maple: Multi-modal prompt learning. arXiv preprint arXiv:2210.03117, 2022.
- Length-adaptive transformer: Train once with length drop, use anytime with search. arXiv preprint arXiv:2010.07003, 2020.
- Learned token pruning for transformers. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 784–794, 2022.
- Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pp. 5583–5594. PMLR, 2021.
- Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014.
- On information and sufficiency. The annals of mathematical statistics, 22(1):79–86, 1951.
- Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019.
- A study on token pruning for colbert. arXiv preprint arXiv:2112.06540, 2021.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
- LAVIS: A one-stop library for language-vision intelligence. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pp. 31–41. Association for Computational Linguistics, July 2023a.
- Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. arXiv preprint arXiv:2201.12086, 2022.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b.
- Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
- Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, pp. 121–137. Springer, 2020.
- Expediting large-scale vision transformer for dense prediction without fine-tuning. Advances in Neural Information Processing Systems, 35:35462–35477, 2022a.
- Not all patches are what you need: Expediting vision transformers via token reorganizations. arXiv preprint arXiv:2202.07800, 2022b.
- Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
- Microsoft coco: Common objects in context. In European conference on computer vision, pp. 740–755. Springer, 2014.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
- Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- The jensen-shannon divergence. Journal of the Franklin Institute, 334(2):307–318, 1997.
- Are sixteen heads really better than one? Advances in neural information processing systems, 32, 2019.
- Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.
- OpenAI. Gpt-4 technical report. 2023a. URL https://cdn.openai.com/papers/gpt-4.pdf.
- OpenAI. Gpt-4v(ision) system card, 2023b. URL https://cdn.openai.com/papers/GPTV_System_Card.pdf.
- Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pp. 3498–3505. IEEE, 2012.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748–8763. PMLR, 2021.
- Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in neural information processing systems, 34:13937–13949, 2021.
- Multi-encoder parse-decoder network for sequential medical image segmentation. In 2021 IEEE international conference on image processing (ICIP), pp. 31–35. IEEE, 2021.
- Heuristic dropout: An efficient regularization method for medical image segmentation models. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1101–1105. IEEE, 2022.
- UPop: Unified and progressive pruning for compressing vision-language transformers. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pp. 31292–31311. PMLR, 2023.
- Flava: A foundational language and vision alignment model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15638–15650, 2022.
- Vitas: vision transformer architecture search. In European Conference on Computer Vision, pp. 139–157. Springer, 2022.
- A corpus for reasoning about natural language grounded in photographs. arXiv preprint arXiv:1811.00491, 2018.
- Lst: Ladder side-tuning for parameter and memory efficient transfer learning. Advances in Neural Information Processing Systems, 35:12991–13005, 2022a.
- Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5227–5237, 2022b.
- Compression of generative pre-trained language models via quantization. arXiv preprint arXiv:2203.10705, 2022.
- Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pp. 10347–10357. PMLR, 2021.
- Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164, 2015.
- Minivlm: A smaller and faster vision-language model. arXiv preprint arXiv:2012.06946, 2020a.
- Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. arXiv preprint arXiv:2202.03052, 2022a.
- Efficientvlm: Fast and accurate vision-language models via knowledge distillation and modal-adaptive pruning. arXiv preprint arXiv:2210.07795, 2022b.
- Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 33:5776–5788, 2020b.
- Smoothquant: Accurate and efficient post-training quantization for large language models. arXiv preprint arXiv:2211.10438, 2022.
- Masked generative distillation. arXiv preprint arXiv:2205.01529, 2022.
- Stacked attention networks for image question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 21–29, 2016.
- mplug-owl: Modularization empowers large language models with multimodality. 2023.
- A-vit: Adaptive tokens for efficient vision transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10809–10818, 2022.
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
- Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
- Rptq: Reorder-based post-training quantization for large language models, 2023.
- Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3713–3722, 2019.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. 2023.
- Vision transformer pruning. arXiv preprint arXiv:2104.08500, 2021.
- Dachuan Shi (8 papers)
- Chaofan Tao (27 papers)
- Anyi Rao (28 papers)
- Zhendong Yang (10 papers)
- Chun Yuan (127 papers)
- Jiaqi Wang (218 papers)