Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer (2403.02991v1)

Published 5 Mar 2024 in cs.CV

Abstract: Vision-Language Transformers (VLTs) have shown great success recently, but are meanwhile accompanied by heavy computation costs, where a major reason can be attributed to the large number of visual and language tokens. Existing token pruning research for compressing VLTs mainly follows a single-modality-based scheme yet ignores the critical role of aligning different modalities for guiding the token pruning process, causing the important tokens for one modality to be falsely pruned in another modality branch. Meanwhile, existing VLT pruning works also lack the flexibility to dynamically compress each layer based on different input samples. To this end, we propose a novel framework named Multimodal Alignment-Guided Dynamic Token Pruning (MADTP) for accelerating various VLTs. Specifically, we first introduce a well-designed Multi-modality Alignment Guidance (MAG) module that can align features of the same semantic concept from different modalities, to ensure the pruned tokens are less important for all modalities. We further design a novel Dynamic Token Pruning (DTP) module, which can adaptively adjust the token compression ratio in each layer based on different input instances. Extensive experiments on various benchmarks demonstrate that MADTP significantly reduces the computational complexity of kinds of multimodal models while preserving competitive performance. Notably, when applied to the BLIP model in the NLVR2 dataset, MADTP can reduce the GFLOPs by 80% with less than 4% performance degradation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Vqa: Visual question answering. In IEEE International Conference on Computer Vision (ICCV), 2015.
  2. Structured pruning of deep convolutional neural networks. ACM Journal on Emerging Technologies in Computing Systems, 13(3), 2015.
  3. Muti-scale and token mergence: Make your vit more efficient. arXiv preprint arXiv:2306.04897, 2023.
  4. What is the state of neural network pruning? Proceedings of machine learning and systems, 2:129–146, 2020.
  5. Token merging: Your vit but faster. arXiv preprint arXiv:2210.09461, 2022.
  6. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  7. Vision transformer slimming: Multi-dimension searching in continuous optimization space. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  8. A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282, 2017.
  9. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703, 2020.
  10. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North, 2019.
  11. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  12. Pmr: Prototypical modal rebalance for multimodal learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20029–20038, 2023.
  13. Compressing visual-linguistic model via knowledge distillation. arXiv: Computer Vision and Pattern Recognition,arXiv: Computer Vision and Pattern Recognition, 2021.
  14. A survey of quantization methods for efficient neural network inference. In Low-Power Computer Vision, pages 291–326. Chapman and Hall/CRC, 2022.
  15. Compressing bert: Studying the effects of weight pruning on transfer learning. arXiv preprint arXiv:2002.08307, 2020.
  16. Knowledge distillation: A survey. International Journal of Computer Vision, 129:1789–1819, 2021.
  17. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
  18. Elip: Efficient language-image pre-training with fewer vision tokens. arXiv preprint arXiv:2309.16738, 2023.
  19. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
  20. Dynamic neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7436–7456, 2021.
  21. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  22. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
  23. Guiding the long-short term memory model for image caption generation. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 2407–2415. IEEE Computer Society, 2015.
  24. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  25. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pages 5583–5594. PMLR, 2021.
  26. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
  27. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  28. Not all patches are what you need: Expediting vision transformers via token reorganizations. ICLR, 2022.
  29. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  30. Adaptive sparse vit: Towards learnable adaptive token pruning by fully exploiting self-attention. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23, pages 1222–1230. International Joint Conferences on Artificial Intelligence Organization, 2023. Main Track.
  31. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  32. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  33. From softmax to sparsemax: A sparse model of attention and multi-label classification. In International conference on machine learning, pages 1614–1623. PMLR, 2016.
  34. Adavit: Adaptive vision transformers for efficient image recognition. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  35. Balanced multimodal learning via on-the-fly gradient modulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8238–8247, 2022.
  36. Learning transferable visual models from natural language supervision. Cornell University - arXiv,Cornell University - arXiv, 2021.
  37. Dynamicvit: Efficient vision transformers with dynamic token sparsification. Neural Information Processing Systems,Neural Information Processing Systems, 2021.
  38. UPop: Unified and progressive pruning for compressing vision-language transformers. In Proceedings of the 40th International Conference on Machine Learning, pages 31292–31311. PMLR, 2023a.
  39. Crossget: Cross-guided ensemble of tokens for accelerating vision-language transformers. arXiv preprint arXiv:2305.17455, 2023b.
  40. Learning efficient sparse and low rank models. IEEE transactions on pattern analysis and machine intelligence, 37(9):1821–1833, 2015.
  41. A corpus for reasoning about natural language grounded in photographs. arXiv preprint arXiv:1811.00491, 2018.
  42. You need multiple exiting: Dynamic early exiting for accelerating unified vision language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10781–10791, 2023.
  43. Patch slimming for efficient vision transformers. Cornell University - arXiv,Cornell University - arXiv, 2021.
  44. Fully dynamic inference with deep neural networks. IEEE Transactions on Emerging Topics in Computing, PP:1–1, 2021.
  45. Global vision transformer pruning with hessian-aware saliency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18547–18557, 2023.
  46. A-ViT: Adaptive tokens for efficient vision transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022a.
  47. A-vit: Adaptive tokens for efficient vision transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10809–10818, 2022b.
  48. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
  49. X-pruner: explainable pruning for vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 24355–24363, 2023.
  50. On compressing deep models by low rank and sparse decomposition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7370–7379, 2017.
  51. Savit: Structure-aware vision transformer pruning via collaborative optimization. Advances in Neural Information Processing Systems, 35:9010–9023, 2022.
Citations (8)

Summary

We haven't generated a summary for this paper yet.