Evaluation and Enhancement of Semantic Grounding in Large Vision-Language Models (2309.04041v2)
Abstract: Large Vision-LLMs (LVLMs) offer remarkable benefits for a variety of vision-language tasks. However, a challenge hindering their application in real-world scenarios, particularly regarding safety, robustness, and reliability, is their constrained semantic grounding ability, which pertains to connecting language to the physical-world entities or concepts referenced in images. Therefore, a crucial need arises for a comprehensive study to assess the semantic grounding ability of widely used LVLMs. Despite the significance, sufficient investigation in this direction is currently lacking. Our work bridges this gap by designing a pipeline for generating large-scale evaluation datasets covering fine-grained semantic information, such as color, number, material, etc., along with a thorough assessment of seven popular LVLMs' semantic grounding ability. Results highlight prevalent misgrounding across various aspects and degrees. To address this issue, we propose a data-centric enhancement method that aims to improve LVLMs' semantic grounding ability through multimodal instruction tuning on fine-grained conversations. Experiments on enhanced LVLMs demonstrate notable improvements in addressing misgrounding issues.
- OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models. arXiv preprint arXiv:2308.01390.
- Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models. arXiv preprint arXiv:2401.00625.
- A Prompt Array Keeps the Bias Away: Debiasing Vision-Language Models with Adversarial Learning. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, 806–822.
- MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 5558–5570.
- A Survey on Knowledge Graphs for Healthcare: Resources, Application Progress, and Promise. In ICML 3rd Workshop on Interpretable Machine Learning in Healthcare (IMLH).
- Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500.
- Demšar, J. 2006. Statistical comparisons of classifiers over multiple data sets. The Journal of Machine learning research, 7(1): 1–30.
- Co2PT: Mitigating Bias in Pre-trained Language Models through Counterfactual Contrastive Prompt Tuning. Findings-EMNLP.
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations.
- MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models. arXiv preprint arXiv:2306.13394.
- Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010.
- Detecting and Preventing Hallucinations in Large Vision Language Models. arXiv preprint arXiv:2308.06394.
- Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12): 1–38.
- OpenImages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://storage.googleapis.com/openimages/web/index.html.
- Otter: A Multi-Modal Model with In-Context Instruction Tuning. arXiv preprint arXiv:2305.03726.
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In ICML.
- Grounded language-image pre-training. In CVPR.
- Evaluating Object Hallucination in Large Vision-Language Models. arXiv preprint arXiv:2305.10355.
- Beyond One-Model-Fits-All: A Survey of Domain Specialization for Large Language Models. arXiv preprint arXiv:2305.18703.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485.
- MMBench: Is Your Multi-modal Model an All-around Player? arXiv preprint arXiv:2307.06281.
- MuG: A Multimodal Classification Benchmark on Game Data with Tabular, Textual, and Visual Fields. In Findings-EMNLP.
- Good, better, best: Textual distractors generation for multiple-choice visual question answering via reinforcement learning. In CVPR 2022 Workshop on Open-Domain Retrieval Under a Multi-Modal Setting, 4921–4930.
- Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS).
- Cheap and quick: Efficient vision-language instruction tuning for large language models. arXiv preprint arXiv:2305.15023.
- Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651.
- Human-in-the-Loop Machine Learning: Active learning and annotation for human-centered AI. Simon and Schuster.
- Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research, 70: 1373–1411.
- OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 27730–27744.
- PACO: Parts and Attributes of Common Objects. arXiv preprint arXiv:2301.01795.
- Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.
- Any-to-Any Generation via Composable Diffusion. arXiv preprint arXiv:2305.11846.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Self-Consistency Improves Chain of Thought Reasoning in Language Models. In ICLR.
- Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265.
- Tackling Vision Language Tasks Through Learning Inner Monologues. arXiv preprint arXiv:2308.09970.
- SpatialSense: An Adversarially Crowdsourced Benchmark for Spatial Relation Recognition. In International Conference on Computer Vision (ICCV).
- What You See is What You Read? Improving Text-Image Alignment Evaluation. arXiv preprint arXiv:2305.10400.
- mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality. arXiv:2304.14178.
- LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark. arXiv preprint arXiv:2306.06687.
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL, 2: 67–78.
- MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities. arXiv preprint arXiv:2308.02490.
- Does Vision-and-Language Pretraining Improve Lexical Grounding? In Findings-EMNLP.
- Data-centric ai: Perspectives and challenges. In SDM, 945–948.
- Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923.
- On evaluating adversarial robustness of large vision-language models. arXiv preprint arXiv:2305.16934.
- Chatbridge: Bridging modalities with large language model as a language catalyst. arXiv preprint arXiv:2305.16103.
- MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv preprint arXiv:2304.10592.
- Jiaying Lu (22 papers)
- Jinmeng Rao (19 papers)
- Kezhen Chen (12 papers)
- Xiaoyuan Guo (14 papers)
- Yawen Zhang (23 papers)
- Baochen Sun (11 papers)
- Carl Yang (130 papers)
- Jie Yang (516 papers)