Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
121 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs (2312.14135v2)

Published 21 Dec 2023 in cs.CV

Abstract: When we look around and perform complex tasks, how we see and selectively process what we see is crucial. However, the lack of this visual search mechanism in current multimodal LLMs (MLLMs) hinders their ability to focus on important visual details, especially when handling high-resolution and visually crowded images. To address this, we introduce V*, an LLM-guided visual search mechanism that employs the world knowledge in LLMs for efficient visual querying. When combined with an MLLM, this mechanism enhances collaborative reasoning, contextual understanding, and precise targeting of specific visual elements. This integration results in a new MLLM meta-architecture, named Show, sEArch, and TelL (SEAL). We further create V*Bench, a benchmark specifically designed to evaluate MLLMs in their ability to process high-resolution images and focus on visual details. Our study highlights the necessity of incorporating visual search capabilities into multimodal systems. The code is available https://github.com/penghao-wu/vstar.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
  2. Language models are few-shot learners. 2020.
  3. Coco-stuff: Thing and stuff classes in context. In CVPR, 2018.
  4. End-to-end object detection with transformers. In ECCV, 2020.
  5. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023.
  6. Scaling vision transformers to gigapixel images via hierarchical self-supervised learning. In CVPR, 2022.
  7. Coco-search18 fixation dataset for predicting goal-directed attention control. Scientific reports, 2021.
  8. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, 2023.
  9. Google DeepMind. Gemini, 2023.
  10. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
  11. Google. Bard, 2023.
  12. Visual programming: Compositional visual reasoning without training. In CVPR, 2023.
  13. LoRA: Low-rank adaptation of large language models. In ICLR, 2022.
  14. Avis: Autonomous visual information seeking with large language models. In NeurIPS, 2023.
  15. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
  16. Daniel Kahneman. Thinking, fast and slow. macmillan, 2011.
  17. Mdetr – modulated detection for end-to-end multi-modal understanding. In ICCV, 2021.
  18. ”ReferItGame: Referring to objects in photographs of natural scenes”. In EMNLP, 2014.
  19. Segment anything. In ICCV, 2023.
  20. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
  21. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023a.
  22. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023b.
  23. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023c.
  24. Evaluating object hallucination in large vision-language models. In EMNLP, 2023d.
  25. Microsoft coco: Common objects in context. In ECCV, 2014.
  26. Focal loss for dense object detection. In ICCV, 2017.
  27. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
  28. Visual instruction tuning. In NeurIPS, 2023b.
  29. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023c.
  30. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023d.
  31. Chameleon: Plug-and-play compositional reasoning with large language models. arXiv preprint arXiv:2304.09842, 2023.
  32. Generation and comprehension of unambiguous object descriptions. In CVPR, 2016a.
  33. Generation and comprehension of unambiguous object descriptions. In CVPR, 2016b.
  34. Simple open-vocabulary object detection with vision transformers. In ECCV, 2022.
  35. OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  36. The power of tiling for small object detection. In CVPR Workshops, 2019.
  37. A neural basis for real-world visual search in human occipitotemporal cortex. Proceedings of the National Academy of Sciences, 2011.
  38. Learning to predict visual attributes in the wild. In CVPR, 2021.
  39. Learning transferable visual models from natural language supervision. In ICML, 2021.
  40. Paco: Parts and attributes of common objects. In CVPR, 2023.
  41. Modeling human visual search: A combined bayesian searcher and saliency map approach for eye movement guidance in natural scenes. In NeurIPS Workshop SVRHM, 2020.
  42. Objects365: A large-scale, high-quality dataset for object detection. In ICCV, 2019.
  43. Makoto Shinkai. Children who chase lost voices. Motion picture, 2011. Produced by CoMix Wave Films. Distributed by Sentai Filmworks.
  44. Makoto Shinkai. Weathering with you. Motion picture, 2019. Produced by CoMix Wave Films. Distributed by Toho.
  45. Resolution-robust large mask inpainting with fourier convolutions. In WACV, 2022.
  46. Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. Psychological Review, 2006.
  47. What makes for good visual tokenizers for large language models? arXiv preprint arXiv:2305.12223, 2023a.
  48. Statistical learning speeds visual search: More efficient selection, or faster response? Journal of Experimental Psychology: General, 2023b.
  49. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 2022.
  50. Jeremy M Wolfe. Visual search: How do we find what we are looking for? Annual review of vision science, 2020.
  51. Five factors that guide attention in visual search. Nature Human Behaviour, 2017.
  52. Visual search in scenes involves selective and nonselective pathways. Trends in Cognitive Sciences, 2011.
  53. Visual ChatGPT: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023a.
  54. Visual ChatGPT: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023b.
  55. Predicting goal-directed human attention using inverse reinforcement learning. In CVPR, 2020.
  56. IdealGPT: Iteratively decomposing vision and language reasoning via large language models. In EMNLP, 2023.
  57. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
  58. Contextual object detection with multimodal large language models. arXiv preprint arXiv:2305.18279, 2023.
  59. Finding any waldo with zero-shot invariant and efficient visual search. Nature communications, 2018.
  60. Zoom to learn, learn to zoom. In CVPR, 2019.
  61. Judging LLM-as-a-judge with MT-bench and chatbot arena. In NeurIPS, 2023.
  62. ChatGPT asks, blip-2 answers: Automatic questioning towards enriched visual descriptions. arXiv preprint arXiv:2303.06594, 2023a.
  63. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023b.
Citations (67)

Summary

  • The paper introduces the SEAL framework that integrates a guided visual search (V*) mechanism to accurately locate missing visual details in high-resolution images.
  • It presents a novel, hierarchical search algorithm leveraging common-sense cues to improve visual grounding in complex multimodal tasks.
  • Experimental results on V*Bench show that the approach outperforms existing models with a performance of 75.39% in visual grounding accuracy.

V^*: Guided Visual Search as a Core Mechanism in Multimodal LLMs

Penghao Wu and Saining Xie introduce V^* in their paper, "V^*: Guided Visual Search as a Core Mechanism in Multimodal LLMs," focusing on enhancing Multimodal LLMs (MLLMs) through a guided visual search mechanism. MLLMs, which integrate visual and textual information to perform complex reasoning tasks, often struggle with precise visual grounding, especially in high-resolution and visually dense images. This paper proposes the SEAL (Show, sEArch, and TelL) meta-architecture to bridge this gap, leveraging the visual search capability akin to human cognition.

Overview

Current MLLMs utilize pre-trained vision encoders, like CLIP, which face significant limitations when processing high-resolution images, often missing critical visual details. The paper argues that the lack of a visual search mechanism impedes MLLMs' ability to accurately localize and recognize key objects within an image, thereby affecting their performance in detailed visual tasks.

The SEAL Framework

The SEAL framework incorporates a guided visual search mechanism, named V^*, which operates in conjunction with an MLLM. This integration allows the system to proactively search for and incorporate missing visual details into a Visual Working Memory (VWM). The SEAL framework consists of two main components:

  1. VQA LLM: This model evaluates whether the initial visual features from the encoder suffice to answer a question. If not, it explicitly lists the missing details and initializes a VWM to store target objects and their coordinates.
  2. V^* Guided Visual Search: Using an LLM enriched with common-sense knowledge, this component searches for the specified targets in an image. It works by generating contextual cues and localizing the targets through a hierarchical search process.

V^* Algorithm

The V^* visual search model is designed to mimic human visual search processes, employing both top-down feature guidance and contextual scene guidance. The algorithm selectively searches high-resolution images to locate objects specified by the VQA LLM, thereby enhancing visual grounding accuracy. The steps include:

  • Attempting to locate the target directly using the entire image.
  • If unsuccessful, generating search cue heatmaps for efficient patch-based search.
  • Dividing the image into patches guided by context and recursively searching these patches.

Contributions

The paper makes three key contributions:

  1. SEAL Framework: Establishes a new meta-architecture to integrate active visual search into MLLMs, improving their capability to handle vision-intensive tasks.
  2. V^* Algorithm: Introduces a novel visual search algorithm leveraging LLM's common-sense knowledge for efficient and informed searches.
  3. V^*Bench: Presents a benchmark for evaluating MLLMs on tasks requiring detailed visual grounding in high-resolution images.

Results

The paper demonstrates the effectiveness of the SEAL framework through extensive experiments. SEAL outperforms existing MLLMs, including GPT-4V, on the V^*Bench, showcasing a significant improvement in handling visually dense and high-resolution images. The detailed numerical results highlight the efficacy of incorporating the V^* guided visual search mechanism, achieving an overall performance of 75.39% on the benchmark.

Implications

The introduction of a guided visual search mechanism in MLLMs has profound implications for both practical applications and theoretical developments in AI. Practically, it enhances the precision and reliability of MLLMs in complex visual tasks, making them more suitable for real-world applications like medical imaging and autonomous driving. Theoretically, it underscores the importance of active visual search capabilities in multimodal systems, paving the way for more sophisticated AI systems that closely mimic human cognitive processes.

Future Directions

Future research could explore extending the V^* visual search capability to document analysis, videos, and open-world scenarios. Additionally, integrating more efficient computational models, such as convolution-based architectures, could further optimize the search process. The development of adaptive search strategies that dynamically adjust based on the complexity and type of visual input could also be a promising direction.

In conclusion, the V^* guided visual search mechanism marks a significant step forward in the evolution of multimodal LLMs, addressing a critical bottleneck in visual information processing. The SEAL framework not only improves the precision of visual grounding but also sets the stage for future advancements in AI-driven visual reasoning and understanding.

Youtube Logo Streamline Icon: https://streamlinehq.com