Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Enhancing Perception Capabilities of Multimodal LLMs with Training-Free Fusion (2412.01289v2)

Published 2 Dec 2024 in cs.CV and cs.AI

Abstract: Multimodal LLMs (MLLMs) equip LLMs with visual capabilities by aligning vision encoders with LLMs. Existing methods to enhance the visual perception of MLLMs often involve designing more powerful vision encoders, which requires exploring a vast design space and re-aligning each potential encoder with the LLM, resulting in prohibitively high training costs. In this paper, we introduce VisionFuse, a novel integration framework that efficiently utilizes multiple vision encoders from off-the-shelf MLLMs to enhance visual perception without requiring additional training. Our approach is motivated by the observation that different MLLMs tend to focus on distinct regions given the same query and image. Moreover, we find that the feature distributions of vision encoders within an MLLM family, a group of MLLMs sharing the same pretrained LLM, are highly aligned. Building on these insights, VisionFuse enriches the visual context by concatenating the tokens generated by the vision encoders of selected MLLMs within a family. By merging the parameters of LLMs from these MLLMs, VisionFuse allows a single LLM to align with various vision encoders, significantly reducing deployment overhead. We conduct comprehensive evaluations across multiple multimodal benchmarks using various MLLM combinations, demonstrating substantial improvements in multimodal tasks. Notably, when integrating MiniGemini-8B and SLIME-8B, VisionFuse achieves an average performance increase of over 4%.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Zhuokun Chen (2 papers)
  2. Jinwu Hu (8 papers)
  3. Zeshuai Deng (5 papers)
  4. Yufeng Wang (43 papers)
  5. Bohan Zhuang (79 papers)
  6. Mingkui Tan (124 papers)