Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mitigating Hallucination in Visual Language Models with Visual Supervision (2311.16479v1)

Published 27 Nov 2023 in cs.CV

Abstract: Large vision-LLMs (LVLMs) suffer from hallucination a lot, generating responses that apparently contradict to the image content occasionally. The key problem lies in its weak ability to comprehend detailed content in a multi-modal context, which can be mainly attributed to two factors in training data and loss function. The vision instruction dataset primarily focuses on global description, and the auto-regressive loss function favors text modeling rather than image understanding. In this paper, we bring more detailed vision annotations and more discriminative vision models to facilitate the training of LVLMs, so that they can generate more precise responses without encounter hallucination. On one hand, we generate image-text pairs with detailed relationship annotations in panoptic scene graph dataset (PSG). These conversations pay more attention on detailed facts in the image, encouraging the model to answer questions based on multi-modal contexts. On the other hand, we integrate SAM and mask prediction loss as auxiliary supervision, forcing the LVLMs to have the capacity to identify context-related objects, so that they can generate more accurate responses, mitigating hallucination. Moreover, to provide a deeper evaluation on the hallucination in LVLMs, we propose a new benchmark, RAH-Bench. It divides vision hallucination into three different types that contradicts the image with wrong categories, attributes or relations, and introduces False Positive Rate as detailed sub-metric for each type. In this benchmark, our approach demonstrates an +8.4% enhancement compared to original LLaVA and achieves widespread performance improvements across other models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  2. Baichuan. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. Obj2seq: Formatting objects as sequences with class prompt for visual tasks. In Advances in Neural Information Processing Systems, 2022.
  5. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  6. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  7. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
  8. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  9. Segment anything. arXiv:2304.02643, 2023.
  10. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017.
  11. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
  12. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
  13. Chain of knowledge: A framework for grounding large language models with structured knowledge bases. arXiv preprint arXiv:2305.13269, 2023b.
  14. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023c.
  15. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  16. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023a.
  17. Visual instruction tuning, 2023b.
  18. Llava-plus: Learning to use tools for creating multimodal agents. arXiv:2311.05437, 2023c.
  19. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
  20. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV), pages 565–571. Ieee, 2016.
  21. OpenAI. Gpt-4 technical report, 2023.
  22. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  23. A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922, 2023.
  24. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023.
  25. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  26. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  27. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  28. Vigc: Visual instruction generation and correction. arXiv preprint arXiv:2308.12714, 2023.
  29. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023.
  30. Panoptic scene graph generation. In European Conference on Computer Vision, pages 178–196. Springer, 2022.
  31. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  32. Woodpecker: Hallucination correction for multimodal large language models. arXiv preprint arXiv:2310.16045, 2023.
  33. Contextual object detection with multimodal large language models. arXiv preprint arXiv:2305.18279, 2023.
  34. Mitigating language model hallucination with interactive question-knowledge alignment. arXiv preprint arXiv:2305.13669, 2023.
  35. Svit: Scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087, 2023.
  36. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Zhiyang Chen (27 papers)
  2. Yousong Zhu (19 papers)
  3. Yufei Zhan (10 papers)
  4. Zhaowen Li (7 papers)
  5. Chaoyang Zhao (14 papers)
  6. Jinqiao Wang (76 papers)
  7. Ming Tang (199 papers)
Citations (18)