Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
92 tokens/sec
Gemini 2.5 Pro Premium
51 tokens/sec
GPT-5 Medium
24 tokens/sec
GPT-5 High Premium
17 tokens/sec
GPT-4o
97 tokens/sec
DeepSeek R1 via Azure Premium
92 tokens/sec
GPT OSS 120B via Groq Premium
458 tokens/sec
Kimi K2 via Groq Premium
222 tokens/sec
2000 character limit reached

Pelican: Correcting Hallucination in Vision-LLMs via Claim Decomposition and Program of Thought Verification (2407.02352v2)

Published 2 Jul 2024 in cs.CL

Abstract: Large Visual LLMs (LVLMs) struggle with hallucinations in visual instruction following task(s), limiting their trustworthiness and real-world applicability. We propose Pelican -- a novel framework designed to detect and mitigate hallucinations through claim verification. Pelican first decomposes the visual claim into a chain of sub-claims based on first-order predicates. These sub-claims consist of (predicate, question) pairs and can be conceptualized as nodes of a computational graph. We then use Program-of-Thought prompting to generate Python code for answering these questions through flexible composition of external tools. Pelican improves over prior work by introducing (1) intermediate variables for precise grounding of object instances, and (2) shared computation for answering the sub-question to enable adaptive corrections and inconsistency identification. We finally use reasoning abilities of LLMs to verify the correctness of the claim by considering the consistency and confidence of the (question, answer) pairs from each sub-claim. Our experiments reveal a drop in hallucination rate by ~ 8% - 32% across various baseline LVLMs and a 27% drop compared to approaches proposed for hallucination mitigation on MMHal-Bench. Results on two other benchmarks further corroborate our results.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736.
  2. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433.
  3. Pepa Atanasova. 2024. Generating fact checking explanations. In Accountable and Explainable Methods for Complex Reasoning over Text, pages 83–103. Springer.
  4. Fact checking with insufficient evidence. Transactions of the Association for Computational Linguistics, 10:746–763.
  5. Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930.
  6. Lang Cao. 2023. Enhancing reasoning capabilities of large language models: A graph-based verification approach. arXiv preprint arXiv:2308.09267.
  7. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer.
  8. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588.
  9. Dress: Instructing large vision-language models to align and interact with humans via natural language feedback. arXiv preprint arXiv:2311.10081.
  10. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems, 36.
  11. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394.
  12. A survey on automated fact-checking. Transactions of the Association for Computational Linguistics, 10:178–206.
  13. Tanmay Gupta and Aniruddha Kembhavi. 2023. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953–14962.
  14. Ciem: Contrastive instruction evaluation method for better instruction tuning. arXiv preprint arXiv:2309.02301.
  15. Language is not all you need: Aligning perception with language models. Advances in Neural Information Processing Systems, 36.
  16. Neema Kotonya and Francesca Toni. 2020. Explainable automated fact-checking: A survey. arXiv preprint arXiv:2011.03870.
  17. Language models as fact checkers? arXiv preprint arXiv:2006.04102.
  18. Volcano: mitigating multimodal hallucination through self-feedback guided revision. arXiv preprint arXiv:2311.07362.
  19. Self-checker: Plug-and-play modules for fact-checking with large language models. arXiv preprint arXiv:2305.14623.
  20. Mitigating hallucination in large multi-modal models via robust instruction tuning. In The Twelfth International Conference on Learning Representations.
  21. A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253.
  22. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744.
  23. Llava-next: Improved reasoning, ocr, and world knowledge.
  24. Visual instruction tuning. Advances in neural information processing systems, 36.
  25. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824.
  26. Toolformer: language models can teach themselves to use tools. 2023. arXiv preprint arXiv:2302.04761.
  27. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. arxiv. arXiv preprint arXiv:2303.17580.
  28. Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525.
  29. Vipergpt: Visual inference via python execution for reasoning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11888–11898.
  30. Yolov9: Learning what you want to learn using programmable gradient information. arXiv preprint arXiv:2402.13616.
  31. Haoran Wang and Kai Shu. 2023. Explainable claim verification via knowledge-grounded reasoning with large language models. arXiv preprint arXiv:2310.05253.
  32. Multimodal large language models: A survey. In 2023 IEEE International Conference on Big Data (BigData), pages 2247–2256. IEEE.
  33. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178.
  34. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257.
  35. Woodpecker: Hallucination correction for multimodal large language models. arXiv preprint arXiv:2310.16045.
  36. Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. arXiv preprint arXiv:2312.00849.
  37. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com