Rethinking Visual Dependency in Long-Context Reasoning for Large Vision-Language Models (2410.19732v2)
Abstract: Large Vision-LLMs (LVLMs) excel in cross-model tasks but experience performance declines in long-context reasoning due to overreliance on textual information and reduced visual dependency. In this study, we empirically analyze LVLMs in long-context reasoning, revealing that increased context length leads to a higher dependence on language at the expense of visual dependency. To address this issue, we propose a novel training-free context pruning method that selectively removes less critical textual information. Our approach enhances visual dependency and reduces textual noise, thereby improving LVLM performance in long-context reasoning. We validate our method by constructing a long-context dataset, demonstrating its effectiveness across various LVLMs. Moreover, further analysis confirms the robustness of different token pruning strategies and preliminary explores scaling laws between pruning rates and context length.
- Training-free long-context scaling of large language models. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net.
- Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966.
- Vip-llava: Making large multimodal models understand arbitrary visual prompts. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 12914–12923. IEEE.
- Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. CoRR, abs/2310.09478.
- Sharegpt4v: Improving large multi-modal models with better captions. CoRR, abs/2311.12793.
- Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms. CoRR, abs/2406.07476.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
- Longrope: Extending LLM context window beyond 2 million tokens. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net.
- Contextual position encoding: Learning to count what’s important. CoRR, abs/2405.18719.
- Llmlingua: Compressing prompts for accelerated inference of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 13358–13376. Association for Computational Linguistics.
- LLM maybe longlm: Self-extend LLM context window without tuning. CoRR, abs/2401.01325.
- Llava-next: Improved reasoning, ocr, and world knowledge.
- Visual instruction tuning. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
- Lost in the middle: How language models use long contexts. Trans. Assoc. Comput. Linguistics, 12:157–173.
- Deepseek-vl: Towards real-world vision-language understanding. CoRR, abs/2403.05525.
- OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
- Look, compare, decide: Alleviating hallucination in large vision-language models via multi-view multi-path reasoning. CoRR, abs/2408.17150.
- Parallel context windows for large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 6383–6402. Association for Computational Linguistics.
- Surf: Teaching large vision-language models to selectively utilize retrieved information. arXiv preprint arXiv:2409.14083.
- Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
- Llava-next: A strong zero-shot video understanding model.
- SVIT: scaling up visual instruction tuning. CoRR, abs/2307.04087.
- Thread of thought unraveling chaotic contexts. CoRR, abs/2311.08734.
- Visual in-context learning for large vision-language models. In Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pages 15890–15902. Association for Computational Linguistics.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.