V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs (2312.14135v2)
Abstract: When we look around and perform complex tasks, how we see and selectively process what we see is crucial. However, the lack of this visual search mechanism in current multimodal LLMs (MLLMs) hinders their ability to focus on important visual details, especially when handling high-resolution and visually crowded images. To address this, we introduce V*, an LLM-guided visual search mechanism that employs the world knowledge in LLMs for efficient visual querying. When combined with an MLLM, this mechanism enhances collaborative reasoning, contextual understanding, and precise targeting of specific visual elements. This integration results in a new MLLM meta-architecture, named Show, sEArch, and TelL (SEAL). We further create V*Bench, a benchmark specifically designed to evaluate MLLMs in their ability to process high-resolution images and focus on visual details. Our study highlights the necessity of incorporating visual search capabilities into multimodal systems. The code is available https://github.com/penghao-wu/vstar.
- Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
- Language models are few-shot learners. 2020.
- Coco-stuff: Thing and stuff classes in context. In CVPR, 2018.
- End-to-end object detection with transformers. In ECCV, 2020.
- Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023.
- Scaling vision transformers to gigapixel images via hierarchical self-supervised learning. In CVPR, 2022.
- Coco-search18 fixation dataset for predicting goal-directed attention control. Scientific reports, 2021.
- InstructBLIP: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, 2023.
- Google DeepMind. Gemini, 2023.
- Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
- Google. Bard, 2023.
- Visual programming: Compositional visual reasoning without training. In CVPR, 2023.
- LoRA: Low-rank adaptation of large language models. In ICLR, 2022.
- Avis: Autonomous visual information seeking with large language models. In NeurIPS, 2023.
- Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
- Daniel Kahneman. Thinking, fast and slow. macmillan, 2011.
- Mdetr – modulated detection for end-to-end multi-modal understanding. In ICCV, 2021.
- ”ReferItGame: Referring to objects in photographs of natural scenes”. In EMNLP, 2014.
- Segment anything. In ICCV, 2023.
- Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
- Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023a.
- Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023b.
- BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023c.
- Evaluating object hallucination in large vision-language models. In EMNLP, 2023d.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Focal loss for dense object detection. In ICCV, 2017.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
- Visual instruction tuning. In NeurIPS, 2023b.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023c.
- Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023d.
- Chameleon: Plug-and-play compositional reasoning with large language models. arXiv preprint arXiv:2304.09842, 2023.
- Generation and comprehension of unambiguous object descriptions. In CVPR, 2016a.
- Generation and comprehension of unambiguous object descriptions. In CVPR, 2016b.
- Simple open-vocabulary object detection with vision transformers. In ECCV, 2022.
- OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- The power of tiling for small object detection. In CVPR Workshops, 2019.
- A neural basis for real-world visual search in human occipitotemporal cortex. Proceedings of the National Academy of Sciences, 2011.
- Learning to predict visual attributes in the wild. In CVPR, 2021.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Paco: Parts and attributes of common objects. In CVPR, 2023.
- Modeling human visual search: A combined bayesian searcher and saliency map approach for eye movement guidance in natural scenes. In NeurIPS Workshop SVRHM, 2020.
- Objects365: A large-scale, high-quality dataset for object detection. In ICCV, 2019.
- Makoto Shinkai. Children who chase lost voices. Motion picture, 2011. Produced by CoMix Wave Films. Distributed by Sentai Filmworks.
- Makoto Shinkai. Weathering with you. Motion picture, 2019. Produced by CoMix Wave Films. Distributed by Toho.
- Resolution-robust large mask inpainting with fourier convolutions. In WACV, 2022.
- Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. Psychological Review, 2006.
- What makes for good visual tokenizers for large language models? arXiv preprint arXiv:2305.12223, 2023a.
- Statistical learning speeds visual search: More efficient selection, or faster response? Journal of Experimental Psychology: General, 2023b.
- Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 2022.
- Jeremy M Wolfe. Visual search: How do we find what we are looking for? Annual review of vision science, 2020.
- Five factors that guide attention in visual search. Nature Human Behaviour, 2017.
- Visual search in scenes involves selective and nonselective pathways. Trends in Cognitive Sciences, 2011.
- Visual ChatGPT: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023a.
- Visual ChatGPT: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023b.
- Predicting goal-directed human attention using inverse reinforcement learning. In CVPR, 2020.
- IdealGPT: Iteratively decomposing vision and language reasoning via large language models. In EMNLP, 2023.
- Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
- Contextual object detection with multimodal large language models. arXiv preprint arXiv:2305.18279, 2023.
- Finding any waldo with zero-shot invariant and efficient visual search. Nature communications, 2018.
- Zoom to learn, learn to zoom. In CVPR, 2019.
- Judging LLM-as-a-judge with MT-bench and chatbot arena. In NeurIPS, 2023.
- ChatGPT asks, blip-2 answers: Automatic questioning towards enriched visual descriptions. arXiv preprint arXiv:2303.06594, 2023a.
- MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023b.