Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data (2311.13614v2)

Published 22 Nov 2023 in cs.CV and cs.AI

Abstract: Multi-modal LLMs (MLLMs) tuned on machine-generated instruction-following data have demonstrated remarkable performance in various multi-modal understanding and generation tasks. However, the hallucinations inherent in machine-generated data, which could lead to hallucinatory outputs in MLLMs, remain under-explored. This work aims to investigate various hallucinations (i.e., object, relation, attribute hallucinations) and mitigate those hallucinatory toxicities in large-scale machine-generated visual instruction datasets. Drawing on the human ability to identify factual errors, we present a novel hallucination detection and elimination framework, HalluciDoctor, based on the cross-checking paradigm. We use our framework to identify and eliminate hallucinations in the training data automatically. Interestingly, HalluciDoctor also indicates that spurious correlations arising from long-tail object co-occurrences contribute to hallucinations. Based on that, we execute counterfactual visual instruction expansion to balance data distribution, thereby enhancing MLLMs' resistance to hallucinations. Comprehensive experiments on hallucination evaluation benchmarks show that our method successfully mitigates 44.6% hallucinations relatively and maintains competitive performance compared to LLaVA. The data and code for this paper are publicly available. \url{https://github.com/Yuqifan1117/HalluciDoctor}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  2. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  3. Tomayto, tomahto. beyond token-level answer equivalence for question answering evaluation. arXiv preprint arXiv:2202.07654, 2022.
  4. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
  5. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  6. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
  7. Detecting and preventing hallucinations in large vision language models. arXiv preprint arXiv:2308.06394, 2023.
  8. From images to textual prompts: Zero-shot visual question answering with frozen large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10867–10877, 2023.
  9. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123(1):32–73, 2017.
  10. Coco-counterfactuals: Automatically constructed counterfactual examples for image-text pairs. arXiv preprint arXiv:2309.14356, 2023.
  11. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023a.
  12. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  13. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b.
  14. Empowering vision-language models to follow interleaved vision-language instructions. arXiv preprint arXiv:2308.04152, 2023c.
  15. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023d.
  16. Factual: A benchmark for faithful and consistent textual scene graph parsing. arXiv preprint arXiv:2305.17497, 2023e.
  17. Revo-lion: Evaluating and refining vision-language instruction tuning datasets. arXiv preprint arXiv:2310.06594, 2023.
  18. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  19. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023a.
  20. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b.
  21. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023c.
  22. Evaluation and mitigation of agnosia in multimodal large language models. arXiv preprint arXiv:2309.04041, 2023.
  23. OpenAI. ChatGPT. https://openai.com/blog/chatgpt/, 2023a.
  24. OpenAI. GPT-4. https://openai.com/gpt-4, 2023b.
  25. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  26. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
  27. Counterfactual data augmentation using locally factored dynamics. Advances in Neural Information Processing Systems, 33:3976–3990, 2020.
  28. Object hallucination in image captioning. arXiv preprint arXiv:1809.02156, 2018.
  29. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  30. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34:200–212, 2021.
  31. Vigc: Visual instruction generation and correction. arXiv preprint arXiv:2308.12714, 2023.
  32. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  33. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  34. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  35. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549, 2023a.
  36. Woodpecker: Hallucination correction for multimodal large language models. arXiv preprint arXiv:2310.16045, 2023b.
  37. Visually-prompted language model for fine-grained scene graph generation in an open world. arXiv preprint arXiv:2303.13233, 2023.
  38. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
  39. Analyzing and mitigating object hallucination in large vision-language models. arXiv preprint arXiv:2310.00754, 2023.
  40. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Qifan Yu (14 papers)
  2. Juncheng Li (121 papers)
  3. Longhui Wei (40 papers)
  4. Liang Pang (94 papers)
  5. Wentao Ye (15 papers)
  6. Bosheng Qin (4 papers)
  7. Siliang Tang (116 papers)
  8. Qi Tian (314 papers)
  9. Yueting Zhuang (164 papers)
Citations (39)
Github Logo Streamline Icon: https://streamlinehq.com