Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference (2405.05803v2)

Published 9 May 2024 in cs.CV and cs.AI

Abstract: Multimodal LLMs (MLLMs) demand considerable computations for inference due to the extensive parameters and the additional input tokens needed for visual information representation. Herein, we introduce Visual Tokens Withdrawal (VTW), a plug-and-play module to boost MLLMs for rapid inference. Our approach is inspired by two intriguing phenomena we have observed: (1) the attention sink phenomenon that is prevalent in LLMs also persists in MLLMs, suggesting that initial tokens and nearest tokens receive the majority of attention, while middle vision tokens garner minimal attention in deep layers; (2) the presence of information migration, which implies that visual information is transferred to subsequent text tokens within the first few layers of MLLMs. As per our findings, we conclude that vision tokens are unnecessary in the deep layers of MLLMs. Thus, we strategically withdraw them at a certain layer, enabling only text tokens to engage in subsequent layers. To pinpoint the ideal layer for VTW, we initially analyze a limited set of tiny datasets and choose the first layer that meets the Kullback-Leibler divergence criterion. Our VTW approach can cut computational overhead by over 40\% across diverse multimodal tasks while maintaining performance. Our code is released at \url{https://github.com/lzhxmu/VTW}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Gpt-4 technical report, 2023.
  2. Gemini: A family of highly capable multimodal models, 2023.
  3. Llama: Open and efficient foundation language models, 2023.
  4. Visual instruction tuning. In NeurIPS, pages 1–19, 2023.
  5. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, pages 19730–19742, 2023.
  6. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024.
  7. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
  8. Learn to explain: Multimodal reasoning via thought chains for science question answering. In NeurIPS, pages 2507–2521, 2022.
  9. A diagram is worth a dozen images. In ECCV, pages 235–251, 2016.
  10. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi, 2023.
  11. Mmbench: Is your multi-modal model an all-around player?, 2023.
  12. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2023.
  13. Efficiently scaling transformer inference. In MLsys, pages 1–18, 2023.
  14. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models, 2024.
  15. Crossget: Cross-guided ensemble of tokens for accelerating vision-language transformers, 2023.
  16. Madtp: Multimodal alignment-guided dynamic token pruning for accelerating vision-language transformer, 2024.
  17. Llava-prumerge: Adaptive token reduction for efficient large multimodal models, 2024.
  18. Evaluating object hallucination in large vision-language models, 2023.
  19. Efficient streaming language models with attention sinks, 2023.
  20. Training language models to follow instructions with human feedback. In NeurIPS, pages 27730–27744, 2022.
  21. Llama 2: Open foundation and fine-tuned chat models, 2023.
  22. Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023.
  23. Video-llama: An instruction-tuned audio-visual language model for video understanding, 2023.
  24. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities, 2023.
  25. Improved baselines with visual instruction tuning, 2023.
  26. Video-llava: Learning united visual representation by alignment before projection, 2023.
  27. Dynamicvit: Efficient vision transformers with dynamic token sparsification. In NeurIPS, pages 13937–13949, 2021.
  28. Evo-vit: Slow-fast token evolution for dynamic vision transformer. In AAAI, pages 2964–2972, 2022.
  29. Evit: Expediting vision transformers via token reorganizations. In ICLR, pages 1–21, 2022.
  30. Token merging: Your vit but faster. In ICLR, pages 1–20, 2023.
  31. Token pooling in vision transformers, 2021.
  32. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, pages 1–22, 2021.
  33. Diffrate : Differentiable compression rate for efficient vision transformers. In ICCV, pages 17164–17174, 2023.
  34. Ppt: Token pruning and pooling for efficient vision transformers, 2023.
  35. Power-bert: Accelerating bert inference via progressive word-vector elimination. In ICML, pages 3690–3699, 2020.
  36. Learned token pruning for transformers. In KDD, pages 784–794, 2022.
  37. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  38. Cf-vit: A general coarse-to-fine method for vision transformer. In AAAI, pages 7042–7052, 2023.
  39. Cogvlm: Visual expert for pretrained language models, 2023.
  40. Yi: Open foundation models by 01. ai, 2024.
  41. Kaichen Zhang Bo Li, Peiyuan Zhang et al. Lmms-eval: Accelerating the development of large multimoal models, March 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Zhihang Lin (13 papers)
  2. Mingbao Lin (78 papers)
  3. Luxi Lin (2 papers)
  4. Rongrong Ji (315 papers)
Citations (6)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets