Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VCoder: Versatile Vision Encoders for Multimodal Large Language Models (2312.14233v1)

Published 21 Dec 2023 in cs.CV
VCoder: Versatile Vision Encoders for Multimodal Large Language Models

Abstract: Humans possess the remarkable skill of Visual Perception, the ability to see and understand the seen, helping them make sense of the visual world and, in turn, reason. Multimodal LLMs (MLLM) have recently achieved impressive performance on vision-language tasks ranging from visual question-answering and image captioning to visual reasoning and image generation. However, when prompted to identify or count (perceive) the entities in a given image, existing MLLM systems fail. Working towards developing an accurate MLLM system for perception and reasoning, we propose using Versatile vision enCoders (VCoder) as perception eyes for Multimodal LLMs. We feed the VCoder with perception modalities such as segmentation or depth maps, improving the MLLM's perception abilities. Secondly, we leverage the images from COCO and outputs from off-the-shelf vision perception models to create our COCO Segmentation Text (COST) dataset for training and evaluating MLLMs on the object perception task. Thirdly, we introduce metrics to assess the object perception abilities in MLLMs on our COST dataset. Lastly, we provide extensive experimental evidence proving the VCoder's improved object-level perception skills over existing Multimodal LLMs, including GPT-4V. We open-source our dataset, code, and models to promote research. We open-source our code at https://github.com/SHI-Labs/VCoder

An Evaluation of "VCoder: Versatile Vision Encoders for Multimodal LLMs"

The paper entitled "VCoder: Versatile Vision Encoders for Multimodal LLMs" addresses limitations in present-day Multimodal LLMs (MLLMs) regarding their object perception capabilities. The research identifies a critical gap in MLLMs - while they excel in visual reasoning and question-answering tasks, these models falter on simple yet essential tasks like object identification and counting. This discrepancy epitomizes Moravec's Paradox, which highlights the apparent ease with which machines perform complex tasks compared to basic sensory tasks that humans find effortless.

The principal contribution of this work is the proposal of Versatile vision enCoders (VCoder) which act as auxiliary modules to enhance the perception abilities of MLLMs. VCoder encodes perception modalities such as segmentation and depth maps into embeddings that improve the model's understanding of visual inputs. By adopting a VCoder module interfaced with existing MLLMs, specifically LLaVA-1.5, the paper demonstrates significant advancements in object-level perception tasks without degrading the reasoning performance of the model.

A novel dataset called the COCO Segmentation Text (COST) dataset is also introduced. This dataset focuses on training and evaluating MLLMs in object perception tasks, presenting new questions about objects in each image to bolster the training process in areas where MLLMs exhibit weaknesses. By providing both the essential modality input data and a suite of perception-focused queries, the COST dataset aids in developing more robust evaluation metrics for object perception tasks.

Quantitative metrics establish a framework for evaluating object perception, introducing metrics such as the count score (CS) and hallucination score (HS). These metrics are meticulously designed to assess MLLMs' capabilities in accurately identifying and counting objects, accounting for errors that manifest as hallucinations in object recognition and enumeration.

The empirical results illustrate that models enhanced with VCoder surpass existing MLLM benchmarks (including MiniGPT-4, InstructBLIP, and LLaVA-1.5) on the COST dataset across varying task categories like semantic, instance, and panoptic segmentation. There's a particular emphasis on the qualitative improvements observed with the segmentation map serving as the control input. Moreover, the VCoder-augmented framework demonstrates competitive performance on object order perception tasks as well, elucidating a path forward for integrating additional sensory modalities.

The implications of this research are manifold. From a practical perspective, the work emphasizes the potential for synthesizing multimodal inputs to strengthen perceptual acuity in machine learning models, which can be significantly beneficial in real-world applications demanding high precision in complex visual environments. Theoretically, this paper sheds light on the limitations of current vision-language datasets and suggests a need for more comprehensive data collection efforts that encompass a wider array of objects with diverse vocabulary inclusion for more robust perception training.

Future research based on these findings could broaden into various directions, such as exploring the full integration of VCoder with modalities beyond segmentation and depth to include aspects like motion or audio for multimodal reasoning. Additionally, scaling up datasets like COST to include multifarious object classes and cluttered scenes beyond current constraints could be valuable in perfecting these perception models.

In conclusion, this paper provides a critical evaluation toolset and a methodology for advancing the field of MLLMs by enhancing basic sensory perception capabilities, thus aligning with the ultimate goal of creating systems that mirror human-like perception and understanding.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (79)
  1. Vqa: Visual question answering, 2015.
  2. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
  3. Falcon-40B: an open large language model with state-of-the-art performance. arXiv, 2023.
  4. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv, 2023.
  5. Qwen technical report. arXiv, 2023.
  6. End-to-end object detection with transformers. In ECCV, 2020.
  7. Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
  8. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015.
  9. Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In CVPR, 2020.
  10. Masked-attention mask transformer for universal image segmentation. In CVPR, 2022.
  11. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.
  12. OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
  13. Active appearance models. IEEE Transactions on pattern analysis and machine intelligence, 23(6):681–685, 2001.
  14. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  15. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  16. Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems, pages 2366–2374, 2014.
  17. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv, 2023.
  18. Omnivore: A Single Model for Many Visual Modalities. In CVPR, 2022.
  19. Dilated neighborhood attention transformer. arXiv:2209.15001, 2022.
  20. Neighborhood attention transformer. In CVPR, 2023.
  21. Deep residual learning for image recognition. In CVPR, 2016.
  22. Mask r-cnn. In ICCV, 2017.
  23. Language is not all you need: Aligning perception with language models. ArXiv, abs/2302.14045, 2023.
  24. Gqa: A new dataset for real-world visual reasoning and compositional question answering. CVPR, 2019.
  25. Semask: Semantically masking transformer backbones for effective semantic segmentation. arXiv, 2021.
  26. OneFormer: One Transformer to Rule Universal Image Segmentation. In CVPR, 2023.
  27. Unified language-vision pretraining in llm with dynamic discrete visual tokenization. arXiv, 2023.
  28. Panoptic feature pyramid networks. In CVPR, 2019.
  29. Generating images with multimodal language models. NeurIPS, 2023.
  30. Deanna Kuhn. The Skills of Argument. Cambridge University Press, 1991.
  31. Deeper depth prediction with fully convolutional residual networks. In 2016 Fourth International Conference on 3D Vision (3DV), pages 239–248. IEEE, 2016.
  32. Gradient-based learning applied to document recognition. Proceedings of the IEEE.
  33. Mimic-it: Multi-modal in-context instruction tuning. 2023a.
  34. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023b.
  35. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
  36. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023c.
  37. Evaluating object hallucination in large vision-language models. In EMNLP, 2023d.
  38. Microsoft coco: Common objects in context. In ECCV, 2014.
  39. Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v (ision), llava-1.5, and other multi-modality models. arXiv preprint arXiv:2310.14566, 2023a.
  40. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023b.
  41. Improved baselines with visual instruction tuning, 2023c.
  42. Visual instruction tuning. In NeurIPS, 2023d.
  43. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
  44. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
  45. Negative object presence evaluation (nope) to measure object hallucination in vision-language models. arXiv, 2023.
  46. Neural baby talk. In CVPR, 2018.
  47. H. Moravec. Mind children: The future of robot and human intelligence. Harvard University Press, 1988.
  48. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv, 2023.
  49. Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
  50. OpenAI. Chatgpt. https://chat.openai.com/, 2022.
  51. OpenAI. Gpt-4 technical report, 2023.
  52. Dinov2: Learning robust visual features without supervision, 2023.
  53. Im2text: Describing images using 1 million captioned photographs. In NeurIPS, 2011.
  54. Kosmos-2: Grounding multimodal large language models to the world. ArXiv, abs/2306, 2023.
  55. Learning transferable visual models from natural language supervision. In ICML, 2021a.
  56. Learning transferable visual models from natural language supervision. arXiv, 2021b.
  57. Vision transformers for dense prediction. In ICCV, 2021.
  58. Faster R-CNN: Towards real-time object detection with region proposal networks. arXiv, 2015.
  59. Object hallucination in image captioning. In EMNLP, 2018.
  60. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. In NeurIPS Workshops 2021, 2021.
  61. Generative pretraining in multimodality. arXiv, 2023.
  62. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1653–1660, 2014.
  63. Llama: Open and efficient foundation language models. arXiv, 2023a.
  64. Llama 2: Open foundation and fine-tuned chat models. arXiv, 2023b.
  65. Multimodal few-shot learning with frozen language models. In NeurIPS, 2021.
  66. Image parsing: Unifying segmentation, detection, and recognition. In IJCV, 2005.
  67. Rapid object detection using a boosted cascade of simple features. In CVPR, 2001.
  68. Cogvlm: Visual expert for pretrained language models. arXiv, 2023.
  69. Segformer: Simple and efficient design for semantic segmentation with transformers. In NeurIPS, 2021.
  70. Prompt-free diffusion: Taking” text” out of text-to-image diffusion models. arXiv preprint arXiv:2305.16223, 2023a.
  71. Versatile diffusion: Text, images and variations all in one diffusion model. In ICCV, 2023b.
  72. mplug-owl: Modularization empowers large language models with multimodality, 2023.
  73. What matters in training a gpt4-style language model with multimodal inputs? arXiv preprint arXiv:2307.02469, 2023.
  74. Adding conditional control to text-to-image diffusion models. In ICCV, 2023a.
  75. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv, 2023b.
  76. Minigpt-5: Interleaved vision-and-language generation via generative vokens. arXiv, 2023a.
  77. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023b.
  78. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv, 2023.
  79. Deformable detr: Deformable transformers for end-to-end object detection. arXiv, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Jitesh Jain (11 papers)
  2. Jianwei Yang (93 papers)
  3. Humphrey Shi (97 papers)
Citations (17)
Github Logo Streamline Icon: https://streamlinehq.com