Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Evaluation and Enhancement of Semantic Grounding in Large Vision-Language Models (2309.04041v2)

Published 7 Sep 2023 in cs.CV and cs.CL

Abstract: Large Vision-LLMs (LVLMs) offer remarkable benefits for a variety of vision-language tasks. However, a challenge hindering their application in real-world scenarios, particularly regarding safety, robustness, and reliability, is their constrained semantic grounding ability, which pertains to connecting language to the physical-world entities or concepts referenced in images. Therefore, a crucial need arises for a comprehensive study to assess the semantic grounding ability of widely used LVLMs. Despite the significance, sufficient investigation in this direction is currently lacking. Our work bridges this gap by designing a pipeline for generating large-scale evaluation datasets covering fine-grained semantic information, such as color, number, material, etc., along with a thorough assessment of seven popular LVLMs' semantic grounding ability. Results highlight prevalent misgrounding across various aspects and degrees. To address this issue, we propose a data-centric enhancement method that aims to improve LVLMs' semantic grounding ability through multimodal instruction tuning on fine-grained conversations. Experiments on enhanced LVLMs demonstrate notable improvements in addressing misgrounding issues.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models. arXiv preprint arXiv:2308.01390.
  2. Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models. arXiv preprint arXiv:2401.00625.
  3. A Prompt Array Keeps the Bias Away: Debiasing Vision-Language Models with Adversarial Learning. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, 806–822.
  4. MuRAG: Multimodal Retrieval-Augmented Generator for Open Question Answering over Images and Text. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 5558–5570.
  5. A Survey on Knowledge Graphs for Healthcare: Resources, Application Progress, and Promise. In ICML 3rd Workshop on Interpretable Machine Learning in Healthcare (IMLH).
  6. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500.
  7. Demšar, J. 2006. Statistical comparisons of classifiers over multiple data sets. The Journal of Machine learning research, 7(1): 1–30.
  8. Co2PT: Mitigating Bias in Pre-trained Language Models through Counterfactual Contrastive Prompt Tuning. Findings-EMNLP.
  9. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations.
  10. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models. arXiv preprint arXiv:2306.13394.
  11. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010.
  12. Detecting and Preventing Hallucinations in Large Vision Language Models. arXiv preprint arXiv:2308.06394.
  13. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12): 1–38.
  14. OpenImages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://storage.googleapis.com/openimages/web/index.html.
  15. Otter: A Multi-Modal Model with In-Context Instruction Tuning. arXiv preprint arXiv:2305.03726.
  16. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In ICML.
  17. Grounded language-image pre-training. In CVPR.
  18. Evaluating Object Hallucination in Large Vision-Language Models. arXiv preprint arXiv:2305.10355.
  19. Beyond One-Model-Fits-All: A Survey of Domain Specialization for Large Language Models. arXiv preprint arXiv:2305.18703.
  20. Visual instruction tuning. arXiv preprint arXiv:2304.08485.
  21. MMBench: Is Your Multi-modal Model an All-around Player? arXiv preprint arXiv:2307.06281.
  22. MuG: A Multimodal Classification Benchmark on Game Data with Tabular, Textual, and Visual Fields. In Findings-EMNLP.
  23. Good, better, best: Textual distractors generation for multiple-choice visual question answering via reinforcement learning. In CVPR 2022 Workshop on Open-Domain Retrieval Under a Multi-Modal Setting, 4921–4930.
  24. Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS).
  25. Cheap and quick: Efficient vision-language instruction tuning for large language models. arXiv preprint arXiv:2305.15023.
  26. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651.
  27. Human-in-the-Loop Machine Learning: Active learning and annotation for human-centered AI. Simon and Schuster.
  28. Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research, 70: 1373–1411.
  29. OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774.
  30. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 27730–27744.
  31. PACO: Parts and Attributes of Common Objects. arXiv preprint arXiv:2301.01795.
  32. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.
  33. Any-to-Any Generation via Composable Diffusion. arXiv preprint arXiv:2305.11846.
  34. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  35. Self-Consistency Improves Chain of Thought Reasoning in Language Models. In ICLR.
  36. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265.
  37. Tackling Vision Language Tasks Through Learning Inner Monologues. arXiv preprint arXiv:2308.09970.
  38. SpatialSense: An Adversarially Crowdsourced Benchmark for Spatial Relation Recognition. In International Conference on Computer Vision (ICCV).
  39. What You See is What You Read? Improving Text-Image Alignment Evaluation. arXiv preprint arXiv:2305.10400.
  40. mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality. arXiv:2304.14178.
  41. LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark. arXiv preprint arXiv:2306.06687.
  42. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL, 2: 67–78.
  43. MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities. arXiv preprint arXiv:2308.02490.
  44. Does Vision-and-Language Pretraining Improve Lexical Grounding? In Findings-EMNLP.
  45. Data-centric ai: Perspectives and challenges. In SDM, 945–948.
  46. Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923.
  47. On evaluating adversarial robustness of large vision-language models. arXiv preprint arXiv:2305.16934.
  48. Chatbridge: Bridging modalities with large language model as a language catalyst. arXiv preprint arXiv:2305.16103.
  49. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv preprint arXiv:2304.10592.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Jiaying Lu (22 papers)
  2. Jinmeng Rao (19 papers)
  3. Kezhen Chen (12 papers)
  4. Xiaoyuan Guo (14 papers)
  5. Yawen Zhang (23 papers)
  6. Baochen Sun (11 papers)
  7. Carl Yang (130 papers)
  8. Jie Yang (516 papers)
Citations (8)

Summary

We haven't generated a summary for this paper yet.