Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 56 tok/s
Gemini 2.5 Pro 38 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 22 tok/s Pro
GPT-4o 84 tok/s Pro
Kimi K2 182 tok/s Pro
GPT OSS 120B 420 tok/s Pro
Claude Sonnet 4.5 30 tok/s Pro
2000 character limit reached

Rethinking Visual Dependency in Long-Context Reasoning for Large Vision-Language Models (2410.19732v2)

Published 25 Oct 2024 in cs.CL and cs.CV

Abstract: Large Vision-LLMs (LVLMs) excel in cross-model tasks but experience performance declines in long-context reasoning due to overreliance on textual information and reduced visual dependency. In this study, we empirically analyze LVLMs in long-context reasoning, revealing that increased context length leads to a higher dependence on language at the expense of visual dependency. To address this issue, we propose a novel training-free context pruning method that selectively removes less critical textual information. Our approach enhances visual dependency and reduces textual noise, thereby improving LVLM performance in long-context reasoning. We validate our method by constructing a long-context dataset, demonstrating its effectiveness across various LVLMs. Moreover, further analysis confirms the robustness of different token pruning strategies and preliminary explores scaling laws between pruning rates and context length.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. Training-free long-context scaling of large language models. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net.
  2. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966.
  3. Vip-llava: Making large multimodal models understand arbitrary visual prompts. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 12914–12923. IEEE.
  4. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. CoRR, abs/2310.09478.
  5. Sharegpt4v: Improving large multi-modal models with better captions. CoRR, abs/2311.12793.
  6. Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms. CoRR, abs/2406.07476.
  7. Instructblip: Towards general-purpose vision-language models with instruction tuning. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
  8. Longrope: Extending LLM context window beyond 2 million tokens. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net.
  9. Contextual position encoding: Learning to count what’s important. CoRR, abs/2405.18719.
  10. Llmlingua: Compressing prompts for accelerated inference of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 13358–13376. Association for Computational Linguistics.
  11. LLM maybe longlm: Self-extend LLM context window without tuning. CoRR, abs/2401.01325.
  12. Llava-next: Improved reasoning, ocr, and world knowledge.
  13. Visual instruction tuning. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
  14. Lost in the middle: How language models use long contexts. Trans. Assoc. Comput. Linguistics, 12:157–173.
  15. Deepseek-vl: Towards real-world vision-language understanding. CoRR, abs/2403.05525.
  16. OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
  17. Look, compare, decide: Alleviating hallucination in large vision-language models via multi-view multi-path reasoning. CoRR, abs/2408.17150.
  18. Parallel context windows for large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 6383–6402. Association for Computational Linguistics.
  19. Surf: Teaching large vision-language models to selectively utilize retrieved information. arXiv preprint arXiv:2409.14083.
  20. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
  21. Llava-next: A strong zero-shot video understanding model.
  22. SVIT: scaling up visual instruction tuning. CoRR, abs/2307.04087.
  23. Thread of thought unraveling chaotic contexts. CoRR, abs/2311.08734.
  24. Visual in-context learning for large vision-language models. In Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pages 15890–15902. Association for Computational Linguistics.
  25. Minigpt-4: Enhancing vision-language understanding with advanced large language models. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net.
Citations (8)

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 posts and received 0 likes.