Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MouSi: Poly-Visual-Expert Vision-Language Models (2401.17221v1)

Published 30 Jan 2024 in cs.CV, cs.AI, cs.CL, and cs.LG
MouSi: Poly-Visual-Expert Vision-Language Models

Abstract: Current large vision-LLMs (VLMs) often encounter challenges such as insufficient capabilities of a single visual component and excessively long visual tokens. These issues can limit the model's effectiveness in accurately interpreting complex visual information and over-lengthy contextual information. Addressing these challenges is crucial for enhancing the performance and applicability of VLMs. This paper proposes the use of ensemble experts technique to synergizes the capabilities of individual visual encoders, including those skilled in image-text matching, OCR, image segmentation, etc. This technique introduces a fusion network to unify the processing of outputs from different visual experts, while bridging the gap between image encoders and pre-trained LLMs. In addition, we explore different positional encoding schemes to alleviate the waste of positional encoding caused by lengthy image feature sequences, effectively addressing the issue of position overflow and length limitations. For instance, in our implementation, this technique significantly reduces the positional occupancy in models like SAM, from a substantial 4096 to a more efficient and manageable 64 or even down to 1. Experimental results demonstrate that VLMs with multiple experts exhibit consistently superior performance over isolated visual encoders and mark a significant performance boost as more experts are integrated. We have open-sourced the training code used in this report. All of these resources can be found on our project website.

Introduction

Vision-LLMs (VLMs) have made notable advances, enabling machines to process and interpret complex visual and textual data. However, these multimodal systems often face limitations, notably the suboptimal performance of their visual components and the challenge of handling lengthy visual tokens. In this context, a novel approach is proposed utilizing ensemble experts to create poly-visual-expert VLMs. This method takes advantage of the specialized skills of various visual encoders to enrich the VLMs' visual understanding.

Architecture and Methodology

The comprehensive paper begins by evaluating six pre-trained visual experts—CLIP, DINOv2, LayoutLMv3, Convnext, SAM, and MAE—each with distinct capabilities ranging from image-text matching to object segmentation. Subsequently, an integration technique is devised, leveraging multi-expert fusion networks to merge the individual strengths of these encoders effectively. The researchers focus on two key fusion methods, MLP projection and Q-Former, investigating the potential benefits of each for multi-channel signal transmission.

To further refine model efficiency, the problem of excessive vision token generation is addressed with innovative strategies, such as the multi-patch-one-token projection that compresses visual information, and the exploration of varied positional encoding schemes that offer a significant reduction in the positional embeddings required for visual tokens—an important innovation given the inherent position limitations within VLMs.

Experimental Results

The empirical results underscore the effectiveness of the poly-visual-expert approach. As the number of integrated experts increases, the VLMs displayed improved multimodal capabilities across multiple benchmarks. The findings indicate that VLMs with multiple experts outperform those with isolated visual encoders and achieve a significant performance boost, verified through an extensive set of benchmarks.

Contributions and Conclusion

The paper’s contributions include the novel integration of diverse visual encoders into a cohesive model that better handles multimodal tasks, the introduction of efficient methods for encoding visual information, and the empirical validation of the model's superiority compared to existing models with single visual coding channels.

The evolutionary design and merging strategies take inspiration from biological visual systems, thus bringing VLMs a step closer to the complex and nuanced human-like understanding of multimodal information. The researchers believe that the potential of poly-visual-expert VLMs remains untapped, and with further data enhancement, these models can exhibit even greater performance, thereby consolidating the poly-visual-expert design as a promising direction in the development of advanced VLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8948–8957. 2019.
  2. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433. 2015.
  3. Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 69–85. Springer, 2016.
  4. Agent ai: Surveying the horizons of multimodal interaction. arXiv preprint arXiv:2401.03568, 2024.
  5. The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864, 2023.
  6. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  7. Visual instruction tuning. In NeurIPS. 2023.
  8. When are lemons purple? the concept association bias of clip. arXiv preprint arXiv:2212.12043, 2022.
  9. Winoground: Probing vision and language models for visio-linguistic compositionality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248. 2022.
  10. When and why vision-language models behave like bags-of-words, and what to do about it? In The Eleventh International Conference on Learning Representations. 2022.
  11. The functional diversity of retinal ganglion cells in the mouse. Nature, 529(7586):345–350, 2016.
  12. Vlp: A survey on vision-language pre-training. Machine Intelligence Research, 20(1):38–56, 2023.
  13. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  14. Layoutlmv3: Pre-training for document ai with unified text and image masking, 2022.
  15. What makes for good visual tokenizers for large language models? arXiv preprint arXiv:2305.12223, 2023.
  16. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR, 2021.
  17. Masked autoencoders are scalable vision learners, 2021.
  18. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660. 2021.
  19. What’s" up" with vision-language models? investigating their struggle with spatial reasoning. arXiv preprint arXiv:2310.19785, 2023.
  20. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023.
  21. Convnext v2: Co-designing and scaling convnets with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16133–16142. 2023.
  22. Improved baselines with visual instruction tuning, 2023.
  23. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023.
  24. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  25. Instructblip: Towards general-purpose vision-language models with instruction tuning. arxiv 2023. arXiv preprint arXiv:2305.06500.
  26. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  27. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  28. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  29. Translating math formula images to latex sequences using deep neural networks with sequence-level training, 2019.
  30. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617. 2018.
  31. A comprehensive evaluation benchmark for multimodal large language models. CoRR, abs/2306.13394, 2023.
  32. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913. 2017.
  33. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709. 2019.
  34. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022.
  35. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326. 2019.
  36. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023.
  37. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023.
  38. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
  39. Ocular dominance column development: analysis and simulation. Science, 245(4918):605–615, 1989.
  40. Instructblip: Towards general-purpose vision-language models with instruction tuning. arxiv 2023. arXiv preprint arXiv:2305.06500, 2023.
  41. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023.
  42. Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355, 2023.
  43. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023.
  44. Generative multimodal models are in-context learners, 2023.
  45. Visualgpt: Data-efficient adaptation of pretrained language models for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18030–18040. 2022.
  46. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  47. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023.
  48. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
  49. Imagebind-llm: Multi-modality instruction tuning. arXiv preprint arXiv:2309.03905, 2023.
  50. Improving compositional text-to-image generation with large vision-language models. arXiv preprint arXiv:2310.06311, 2023.
  51. Pointllm: Empowering large language models to understand point clouds. arXiv preprint arXiv:2308.16911, 2023.
  52. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023.
  53. To see is to believe: Prompting gpt-4v for better visual instruction tuning. arXiv preprint arXiv:2311.07574, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (24)
  1. Xiaoran Fan (23 papers)
  2. Tao Ji (28 papers)
  3. Changhao Jiang (7 papers)
  4. Shuo Li (179 papers)
  5. Senjie Jin (10 papers)
  6. Sirui Song (8 papers)
  7. Junke Wang (18 papers)
  8. Boyang Hong (6 papers)
  9. Lu Chen (244 papers)
  10. Guodong Zheng (6 papers)
  11. Ming Zhang (313 papers)
  12. Caishuang Huang (13 papers)
  13. Rui Zheng (78 papers)
  14. Zhiheng Xi (37 papers)
  15. Yuhao Zhou (78 papers)
  16. Shihan Dou (46 papers)
  17. Junjie Ye (66 papers)
  18. Hang Yan (86 papers)
  19. Tao Gui (127 papers)
  20. Qi Zhang (784 papers)
Citations (11)