Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning (2306.14565v4)

Published 26 Jun 2023 in cs.CV, cs.AI, cs.CE, cs.CL, and cs.MM
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Abstract: Despite the promising progress in multi-modal tasks, current large multi-modal models (LMMs) are prone to hallucinating inconsistent descriptions with respect to the associated image and human instructions. This paper addresses this issue by introducing the first large and diverse visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction. Our dataset comprises 400k visual instructions generated by GPT4, covering 16 vision-and-language tasks with open-ended instructions and answers. Unlike existing studies that primarily focus on positive instruction samples, we design LRV-Instruction to include both positive and negative instructions for more robust visual instruction tuning. Our negative instructions are designed at three semantic levels: (i) Nonexistent Object Manipulation, (ii) Existent Object Manipulation and (iii) Knowledge Manipulation. To efficiently measure the hallucination generated by LMMs, we propose GPT4-Assisted Visual Instruction Evaluation (GAVIE), a stable approach to evaluate visual instruction tuning like human experts. GAVIE does not require human-annotated groundtruth answers and can adapt to diverse instruction formats. We conduct comprehensive experiments to investigate the hallucination of LMMs. Our results demonstrate existing LMMs exhibit significant hallucinations when presented with our negative instructions, particularly Existent Object and Knowledge Manipulation instructions. Moreover, we successfully mitigate hallucination by finetuning MiniGPT4 and mPLUG-Owl on LRV-Instruction while improving performance on several public datasets compared to state-of-the-art methods. Additionally, we observed that a balanced ratio of positive and negative instances in the training data leads to a more robust model. Code and data are available at https://github.com/FuxiaoLiu/LRV-Instruction.

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

The research paper, titled "Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning," addresses the issue of hallucination in large multi-modal models (LMMs), specifically focusing on models' propensity to create inconsistent descriptions given the associated image and human instructions. This work introduces a novel approach involving the development of a robust visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction, comprising 400,000 instructions covering 16 vision-and-language tasks. The instructions are generated by GPT-4, emphasizing both positive and negative samples across varied semantic levels: Nonexistent Object Manipulation, Existent Object Manipulation, and Knowledge Manipulation.

The paper highlights the significance of addressing hallucination in LMMs—an issue that not only leads to ethical concerns but can also cause users to over-rely on models, falsely interpreting them as accurate sources of information. LMMs tend to generate inconsistent and incorrect visual descriptions, such as "a dog playing with a ball" when neither exists in the image. The contributing factors, as hypothesized, include a reliance on language priors and synthetic instruction data, which often involve nonexistent or irrelevant elements not pertinent to the given image.

The paper introduces LRV-Instruction to combat these issues by incorporating both positive and negative instructions for robust training. Notably, negative instructions in the dataset are constructed at three semantic levels and two different formats, declarative and interrogative, providing a more comprehensive set of scenarios to test LMMs' capabilities. The authors leverage the GPT-4-Assisted Visual Instruction Evaluation (GAVIE) to evaluate hallucination without depending on human-annotated groundtruth answers—a method demonstrated to align closely with human expert evaluations.

The key findings indicate that existing LMMs, such as MiniGPT4 and mPLUG-Owl, suffer significant hallucination issues under negative instructions, particularly with existent objects and knowledge manipulations. However, when trained on LRV-Instruction, these models demonstrate improved robustness, exhibiting reduced hallucination and increased performance across various tasks beyond that of previous state-of-the-art techniques. Additionally, the results reveal the importance of a balanced dataset ratio of positive and negative samples, aligned with the model's improved ability to generate accurate responses when such balance is achieved.

Implications of this research extend to both theoretical and practical domains. Theoretically, it challenges current assumptions about LMM training, proposing an alternative paradigm that incorporates diverse datasets and explicitly accounts for negative examples. Practically, the paper propels advancements in multimedia applications, where accurate model interpretation is vital, such as in automated image description or human-computer interaction. The provision of LRV-Instruction offers a critical resource that may support future developments in more generalized, robust LMMs.

Future research directions could explore extending the diversity of visual instruction datasets further and integrating stronger vision encoders to improve fine-grained visual understanding capabilities. Additionally, investigating and mitigating other biases, including those not covered by vision-and-language tasks, could enhance the overall robustness of LMMs, ultimately leading to more reliable, trustworthy AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  2. Spice: Semantic propositional image caption evaluation. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14, pages 382–398. Springer, 2016.
  3. Openflamingo, March 2023.
  4. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023, 2023.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568, 2021.
  7. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.
  8. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
  9. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
  10. Chatgpt outperforms crowd-workers for text-annotation tasks. arXiv preprint arXiv:2303.15056, 2023.
  11. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790, 2023.
  12. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
  13. MV Koroteev. Bert: a review of applications in natural language processing and understanding. arXiv preprint arXiv:2103.11943, 2021.
  14. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017.
  15. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023.
  16. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  17. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  18. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
  19. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023.
  20. Towards understanding in-context learning with contrastive demonstrations and saliency maps. arXiv preprint arXiv:2307.05052, 2023.
  21. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  22. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023.
  23. Documentclip: Linking figures and main body text in reflowed documents. arXiv preprint arXiv:2306.06306, 2023.
  24. Visual news: Benchmark and challenges in news image captioning. arXiv preprint arXiv:2010.03743, 2020.
  25. Covid-vts: Fact extraction and verification on short video platforms. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 178–188, 2023.
  26. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  27. Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634, 2023.
  28. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  29. OpenAI. Gpt-4 technical report. 2023.
  30. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  31. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  32. Object hallucination in image captioning. arXiv preprint arXiv:1809.02156, 2018.
  33. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
  34. Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2443–2449, 2021.
  35. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7464–7473, 2019.
  36. Vistext: A benchmark for semantically rich chart captioning. arXiv preprint arXiv:2307.05356, 2023.
  37. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  38. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
  39. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022.
  40. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  41. Grit: A generative region-to-text transformer for object understanding. arXiv preprint arXiv:2212.00280, 2022.
  42. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  43. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
  44. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Fuxiao Liu (17 papers)
  2. Kevin Lin (98 papers)
  3. Linjie Li (89 papers)
  4. Jianfeng Wang (149 papers)
  5. Yaser Yacoob (11 papers)
  6. Lijuan Wang (133 papers)
Citations (170)