Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models (2402.15721v2)

Published 24 Feb 2024 in cs.AI and cs.CL

Abstract: Large Vision LLMs exhibit remarkable capabilities but struggle with hallucinations inconsistencies between images and their descriptions. Previous hallucination evaluation studies on LVLMs have identified hallucinations in terms of objects, attributes, and relations but overlooked complex hallucinations that create an entire narrative around a fictional entity. In this paper, we introduce a refined taxonomy of hallucinations, featuring a new category: Event Hallucination. We then utilize advanced LLMs to generate and filter fine grained hallucinatory data consisting of various types of hallucinations, with a particular focus on event hallucinations, laying the groundwork for integrating discriminative and generative evaluation methods within our universal evaluation framework. The proposed benchmark distinctively assesses LVLMs ability to tackle a broad spectrum of hallucinations, making it a reliable and comprehensive tool for gauging LVLMs efficacy in handling hallucinations. We will release our code and data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736.
  2. Gemini: A family of highly capable multimodal models. ArXiv, abs/2312.11805.
  3. Qwen-vl: A frontier large vision-language model with versatile abilities. ArXiv, abs/2308.12966.
  4. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966.
  5. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer.
  6. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568.
  7. Shikra: Unleashing multimodal llm’s referential dialogue magic. ArXiv, abs/2306.15195.
  8. Sharegpt4v: Improving large multi-modal models with better captions. ArXiv, abs/2311.12793.
  9. Palm: Scaling language modeling with pathways. J. Mach. Learn. Res., 24:240:1–240:113.
  10. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500.
  11. Palm-e: An embodied multimodal language model. In International Conference on Machine Learning.
  12. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394.
  13. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR).
  14. Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models. arXiv preprint arXiv:2310.14566.
  15. Detecting and preventing hallucinations in large vision language models. arXiv preprint arXiv:2308.06394.
  16. Ciem: Contrastive instruction evaluation method for better instruction tuning. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following.
  17. Language is not all you need: Aligning perception with language models. ArXiv, abs/2302.14045.
  18. Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709.
  19. Hallucination augmented contrastive learning for multimodal large language model. ArXiv, abs/2312.06968.
  20. TRIPS: Efficient vision-and-language pre-training with text-relevant image patch selection. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4084–4096, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  21. Bus: Efficient and effective vision-language pre-training with bottom-up patch summarization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2900–2910.
  22. Copa: Efficient vision-language pre-training through collaborative object-and patch-text alignment. In Proceedings of the 31st ACM International Conference on Multimedia, pages 4480–4491.
  23. Vision language pre-training by contrastive learning with cross-modal similarity regulation. In Annual Meeting of the Association for Computational Linguistics.
  24. Timix: Text-aware image mixing for effective vision-language pre-training. ArXiv, abs/2312.08846.
  25. Faithscore: Evaluating hallucinations in large vision-language models. arXiv preprint arXiv:2311.01477.
  26. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726.
  27. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125.
  28. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ArXiv, abs/2301.12597.
  29. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML.
  30. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR.
  31. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705.
  32. Evaluating object hallucination in large vision-language models. In EMNLP.
  33. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Annual Meeting of the Association for Computational Linguistics.
  34. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer.
  35. Mitigating hallucination in large multi-modal models via robust instruction tuning. arXiv preprint arXiv:2306.14565.
  36. A survey on hallucination in large vision-language models. ArXiv, abs/2402.00253.
  37. Improved baselines with visual instruction tuning. ArXiv, abs/2310.03744.
  38. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744.
  39. Visual instruction tuning. ArXiv, abs/2304.08485.
  40. Visual instruction tuning. In NeurIPS.
  41. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281.
  42. Negative object presence evaluation (nope) to measure object hallucination in vision-language models. arXiv preprint arXiv:2310.05338.
  43. OpenAI. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  44. Im2text: Describing images using 1 million captioned photographs. In Neural Information Processing Systems.
  45. Bleu: a method for automatic evaluation of machine translation. In Annual Meeting of the Association for Computational Linguistics.
  46. Zero: Memory optimizations toward training trillion parameter models.
  47. Object hallucination in image captioning. In EMNLP.
  48. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294.
  49. Aligning large multimodal models with factually augmented rlhf. ArXiv, abs/2309.14525.
  50. Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525.
  51. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  52. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  53. An llm-free multi-dimensional benchmark for mllms hallucination evaluation. arXiv preprint arXiv:2311.07397.
  54. Evaluation and analysis of hallucination in large vision-language models. arXiv preprint arXiv:2308.15126.
  55. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178.
  56. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490.
  57. Halle-switch: Rethinking and controlling object existence hallucinations in large vision language models for detailed caption. arXiv preprint arXiv:2310.01779.
  58. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Chaoya Jiang (15 papers)
  2. Wei Ye (110 papers)
  3. Mengfan Dong (5 papers)
  4. Hongrui Jia (4 papers)
  5. Haiyang Xu (67 papers)
  6. Ming Yan (190 papers)
  7. Ji Zhang (176 papers)
  8. Shikun Zhang (82 papers)
Citations (8)
X Twitter Logo Streamline Icon: https://streamlinehq.com