Can Large Vision-Language Models Correct Semantic Grounding Errors By Themselves? (2404.06510v2)
Abstract: Enhancing semantic grounding abilities in Vision-LLMs (VLMs) often involves collecting domain-specific training data, refining the network architectures, or modifying the training recipes. In this work, we venture into an orthogonal direction and explore whether VLMs can improve their semantic grounding by "receiving" feedback, without requiring in-domain data, fine-tuning, or modifications to the network architectures. We systematically analyze this hypothesis using a feedback mechanism composed of a binary signal. We find that if prompted appropriately, VLMs can utilize feedback both in a single step and iteratively, showcasing the potential of feedback as an alternative technique to improve grounding in internet-scale VLMs. Furthermore, VLMs, like LLMs, struggle to self-correct errors out-of-the-box. However, we find that this issue can be mitigated via a binary verification mechanism. Finally, we explore the potential and limitations of amalgamating these findings and applying them iteratively to automatically enhance VLMs' grounding performance, showing grounding accuracy consistently improves using automated feedback across all models in all settings investigated. Overall, our iterative framework improves semantic grounding in VLMs by more than 15 accuracy points under noise-free feedback and up to 5 accuracy points under a simple automated binary verification mechanism. The project website is hosted at https://andrewliao11.github.io/vlms_feedback
- VQA: Visual Question Answering. In International Conference on Computer Vision (ICCV), 2015.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Making large multimodal models understand arbitrary visual prompts. arXiv preprint arXiv:2312.00784, 2023.
- Making large multimodal models understand arbitrary visual prompts. In IEEE Conference on Computer Vision and Pattern Recognition, 2024.
- Shikra: Unleashing multimodal llm’s referential dialogue magic, 2023.
- Teaching large language models to self-debug. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=KuPixIqPiq.
- A survey on multimodal large language models for autonomous driving. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 958–979, 2024.
- Learning to evaluate image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Improving factuality and reasoning in language models through multiagent debate, 2024. URL https://openreview.net/forum?id=QAwaaLJNCk.
- CRITIC: Large language models can self-correct with tool-interactive critiquing. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Sx038qxjek.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
- Open-vocabulary object detection via vision and language knowledge distillation. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=lL3lnMbR4WU.
- Regiongpt: Towards region understanding vision language model. arXiv preprint arXiv:2403.02330, 2024.
- CLIPScore: A reference-free evaluation metric for image captioning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514–7528, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.595. URL https://aclanthology.org/2021.emnlp-main.595.
- Large language models cannot self-correct reasoning yet, 2023a.
- Language is not all you need: Aligning perception with language models, 2023b.
- Huggingface. sentence-transformers/all-mpnet-base-v2. https://huggingface.co/sentence-transformers/all-mpnet-base-v2.
- Language models (mostly) know what they know. ArXiv, abs/2207.05221, 2022. URL https://api.semanticscholar.org/CorpusID:250451161.
- Re-evaluating automatic metrics for image captioning. In Mirella Lapata, Phil Blunsom, and Alexander Koller, editors, Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 199–209, Valencia, Spain, April 2017. Association for Computational Linguistics. URL https://aclanthology.org/E17-1019.
- Language models can solve computer tasks. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=M6OmjAZ4CX.
- Panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
- Large language models are zero-shot reasoners. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=e2TBb5y0yFf.
- Collavo: Crayon large language and vision model, 2024.
- Otterhd: A high-resolution multi-modality model, 2023.
- Monkey: Image resolution and text label are important things for large multi-modal models, 2024.
- Vila: On pre-training for visual language models, 2023.
- Microsoft COCO: common objects in context. CoRR, abs/1405.0312, 2014. URL http://arxiv.org/abs/1405.0312.
- Improved baselines with visual instruction tuning, 2023a.
- Visual instruction tuning. In NeurIPS, 2023b.
- Llava-next: Improved reasoning, ocr, and world knowledge, January 2024. URL https://llava-vl.github.io/blog/2024-01-30-llava-next/.
- Llmscore: Unveiling the power of large language models in text-to-image synthesis evaluation, 2023.
- Llmscore: Unveiling the power of large language models in text-to-image synthesis evaluation. Advances in Neural Information Processing Systems, 36, 2024.
- Self-refine: Iterative refinement with self-feedback. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=S37hOerQLB.
- Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36, 2024.
- Mm1: Methods, analysis & insights from multimodal llm pre-training, 2024.
- Pivot: Iterative visual prompting elicits actionable knowledge for vlms, 2024a.
- Pivot: Iterative visual prompting elicits actionable knowledge for vlms. arXiv preprint arXiv:2402.07872, 2024b.
- MAF: Multi-aspect feedback for improving reasoning in large language models. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023. URL https://openreview.net/forum?id=bNeDLx5O6w.
- Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv preprint arXiv:2302.12813, 2023.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- What does clip know about a red circle? visual prompt engineering for vlms. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11987–11997, October 2023.
- C. Spearman. The proof and measurement of association between two things. The American Journal of Psychology, 100(3/4):441–471, 1987. ISSN 00029556. URL http://www.jstor.org/stable/1422689.
- Gemini: A family of highly capable multimodal models, 2023.
- Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Llms cannot find reasoning errors, but can correct them!, 2024.
- Cogvlm: Visual expert for pretrained language models, 2024.
- Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=1PL1NIMMrw.
- Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=_VjQlMeSB_J.
- Gpt-4v(ision) is a human-aligned evaluator for text-to-3d generation, 2024.
- Perils of self-feedback: Self-bias amplifies in large language models, 2024.
- Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v, 2023a.
- Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441, 2023b.
- The dawn of lmms: Preliminary explorations with gpt-4v(ision), 2023c.
- Tree of Thoughts: Deliberate problem solving with large language models, 2023.
- Ferret: Refer and ground anything anywhere at any granularity, 2023.
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
- Modeling context in referring expressions, 2016.
- When and why vision-language models behave like bags-of-words, and what to do about it? In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=KRLUvxh8uaX.
- Can mllms perform text-to-image in-context learning?, 2024.
- Using large language models for hyperparameter optimization. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023a.
- GPT4roi: Instruction tuning large language model on region-of-interest, 2024. URL https://openreview.net/forum?id=DzxaRFVsgC.
- Gpt-4v (ision) as a generalist evaluator for vision-language tasks. arXiv preprint arXiv:2311.01361, 2023b.
- Gpt-4v(ision) as a generalist evaluator for vision-language tasks, 2023c.
- MMICL: Empowering vision-language model with multi-modal in-context learning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=5KojubHBr8.
- Step-back prompting enables reasoning via abstraction in large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=3bq3jsvcQ1.
- Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
- Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=WZH7099tgfM.