Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models (2403.06764v3)

Published 11 Mar 2024 in cs.CV, cs.AI, and cs.CL
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

Abstract: In this study, we identify the inefficient attention phenomena in Large Vision-LLMs (LVLMs), notably within prominent models like LLaVA-1.5, QwenVL-Chat and Video-LLaVA. We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs, suggesting a need for a sparser approach compared to textual data handling. To this end, we introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency by learning adaptive attention patterns in early layers and pruning visual tokens in subsequent ones. Our evaluations demonstrate FastV's ability to dramatically reduce computational costs (e.g., a 45 reduction in FLOPs for LLaVA-1.5-13B) without sacrificing performance in a wide range of image and video understanding tasks. The computational efficiency and performance trade-off of FastV are highly customizable and pareto-efficient. It can compress the FLOPs of a 13B-parameter model to achieve a lower budget than that of a 7B-parameter model, while still maintaining superior performance. We believe FastV has practical values for deployment of LVLMs in edge devices and commercial models. Code is released at https://github.com/pkunlp-icler/FastV.

Plug-and-Play Inference Acceleration for Large Vision-LLMs: Introducing FastV

Efficient Processing of Visual Tokens in Large Vision-LLMs

The paper addresses a vital inefficiency in the handling of visual information by Large Vision-LLMs (LVLMs), with a special focus on renowned models such as LLaVA-1.5, QwenVL-Chat, and Video-LLaVA. Extensive analysis reveals that these models exhibit a markedly inefficient attention pattern towards visual tokens in their deeper layers, with these tokens receiving disproportionately lower attention scores than textual counterparts. This inefficiency signals a need for optimizing how LVLMs process visual data, promoting a shift towards a sparser, more efficient approach.

Introducing FastV: A Plug-and-Play Solution

The proposed FastV represents a ground-breaking solution aimed at enhancing the computational efficiency of LVLMs. By dynamically learning adaptive attention patterns in early layers and then selectively pruning visual tokens in subsequent layers, FastV significantly lowers computational costs. The method boasts a 45\% reduction in Floating Point Operations per Second (FLOPs) for the LLaVA-1.5-13B model, demonstrating its effectiveness without compromising task performance across a broad spectrum of image and video understanding tasks. This balance between computational efficiency and performance makes FastV an invaluable tool, especially for deploying LVLMs in resource-constrained environments like edge devices.

Theoretical and Practical Implications

From a practical standpoint, FastV opens up new avenues for deploying state-of-the-art LVLMs in scenarios where computational resources are limited. The solution’s scalability and flexibility, demonstrated by its capacity to adjust the trade-off between efficiency and performance based on specific needs, present a significant step forward in making advanced vision-language understanding models more accessible.

Theoretically, FastV contributes to the ongoing discourse on how LVLMs process multimodal information. By uncovering the inefficiencies in attention mechanisms of LVLMs and addressing them through token pruning, FastV sheds light on the underlying dynamics of visual data processing within these models. This insight is not only crucial for improving model efficiency but also for enhancing our understanding of the cognitive processes LVLMs employ when integrating visual and textual information.

A Look into the Future

As the field of artificial intelligence continues to evolve towards more integrated multimodal systems, FastV positions itself as a pivotal contribution that aligns with the trajectory towards more efficient and scalable vision-LLMs. Future developments could explore the extension of FastV’s principles to other types of multimodal data beyond visual tokens, potentially opening new frontiers in the quest for computationally efficient AI models that do not sacrifice performance. Moreover, the adaptability of FastV suggests exciting possibilities for customizing models to specific operational constraints, heralding a new era of personalized AI systems that can deliver top-tier performance tailored to individual needs.

In conclusion, FastV marks a significant advancement in the optimization of LVLMs, offering a promising path towards overcoming the computational bottlenecks that have hindered the wider deployment of these models. By striking a delicate balance between efficiency and performance, FastV not only enhances the practical applicability of LVLMs but also provides a novel perspective on their operational dynamics, laying the groundwork for future innovations in the field of artificial intelligence.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. nocaps: novel object captioning at scale. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pp.  8947–8956, 2019.
  2. Qwen-vl: A frontier large vision-language model with versatile abilities. ArXiv preprint, abs/2308.12966, 2023.
  3. Introducing our multimodal models, 2023. URL https://www.adept.ai/blog/fuyu-8b.
  4. Towards end-to-end embodied decision making via multi-modal large language model: Explorations with gpt4-vision and beyond. ArXiv, 2023.
  5. Pca-bench: Evaluating multimodal large language models in perception-cognition-action chain. 2024.
  6. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023.
  7. Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022.
  8. Palm-e: An embodied multimodal language model. volume abs/2303.03378, 2023.
  9. Model tells you what to discard: Adaptive kv cache compression for llms, 2024.
  10. Tgif-qa: Toward spatio-temporal reasoning in visual question answering, 2017.
  11. Videopoet: A large language model for zero-shot video generation, 2023.
  12. Efficient memory management for large language model serving with pagedattention, 2023.
  13. Empowering vision-language models to follow interleaved vision-language instructions. arXiv preprint arXiv:2308.04152, 2023a.
  14. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ArXiv preprint, abs/2301.12597, 2023b.
  15. Silkie: Preference distillation for large visual language models, 2023c.
  16. M33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTIT: A large-scale dataset towards multi-modal multilingual instruction tuning. ArXiv preprint, abs/2306.04387, 2023d.
  17. Llama-vid: An image is worth 2 tokens in large language models, 2023e.
  18. Monkey: Image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607, 2023f.
  19. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023a.
  20. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models, 2023b.
  21. Ring attention with blockwise transformers for near-infinite context, 2023a.
  22. World model on million-length video and language with ringattention, 2024a.
  23. Improved baselines with visual instruction tuning, 2023b.
  24. Visual instruction tuning. ArXiv preprint, abs/2304.08485, 2023c.
  25. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024b. URL https://llava-vl.github.io/blog/2024-01-30-llava-next/.
  26. Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action, 2023.
  27. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv:2306.05424, 2023.
  28. Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pp.  947–952. IEEE, 2019.
  29. OpenAI. Gpt-4v(ision) system card. 2023.
  30. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pp.  2641–2649, 2015.
  31. Learning transferable visual models from natural language supervision. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp.  8748–8763, 2021.
  32. A-okvqa: A benchmark for visual question answering using world knowledge. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VIII, pp.  146–162. Springer, 2022.
  33. Gemini: A family of highly capable multimodal models, 2023.
  34. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp.  5998–6008, 2017.
  35. Cider: Consensus-based image description evaluation, 2015.
  36. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception, 2024.
  37. Label words are anchors: An information flow perspective for understanding in-context learning. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  9840–9855, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.609. URL https://aclanthology.org/2023.emnlp-main.609.
  38. Efficient streaming language models with attention sinks. arXiv, 2023.
  39. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 2017 ACM on Multimedia Conference, MM 2017, Mountain View, CA, USA, October 23-27, 2017, pp.  1645–1653, 2017a.
  40. Video question answering via gradually refined attention over appearance and motion. In ACM Multimedia, 2017b.
  41. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023.
  42. Mmicl: Empowering vision-language model with multi-modal in-context learning. ArXiv preprint, abs/2309.07915, 2023.
  43. Gpt-4v(ision) is a generalist web agent, if grounded, 2024.
  44. Minigpt-4: Enhancing vision-language understanding with advanced large language models. ArXiv preprint, abs/2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Liang Chen (360 papers)
  2. Haozhe Zhao (19 papers)
  3. Tianyu Liu (177 papers)
  4. Shuai Bai (22 papers)
  5. Junyang Lin (99 papers)
  6. Chang Zhou (105 papers)
  7. Baobao Chang (80 papers)
Citations (56)
Youtube Logo Streamline Icon: https://streamlinehq.com