Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning (2306.14565v4)

Published 26 Jun 2023 in cs.CV, cs.AI, cs.CE, cs.CL, and cs.MM

Abstract: Despite the promising progress in multi-modal tasks, current large multi-modal models (LMMs) are prone to hallucinating inconsistent descriptions with respect to the associated image and human instructions. This paper addresses this issue by introducing the first large and diverse visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction. Our dataset comprises 400k visual instructions generated by GPT4, covering 16 vision-and-language tasks with open-ended instructions and answers. Unlike existing studies that primarily focus on positive instruction samples, we design LRV-Instruction to include both positive and negative instructions for more robust visual instruction tuning. Our negative instructions are designed at three semantic levels: (i) Nonexistent Object Manipulation, (ii) Existent Object Manipulation and (iii) Knowledge Manipulation. To efficiently measure the hallucination generated by LMMs, we propose GPT4-Assisted Visual Instruction Evaluation (GAVIE), a stable approach to evaluate visual instruction tuning like human experts. GAVIE does not require human-annotated groundtruth answers and can adapt to diverse instruction formats. We conduct comprehensive experiments to investigate the hallucination of LMMs. Our results demonstrate existing LMMs exhibit significant hallucinations when presented with our negative instructions, particularly Existent Object and Knowledge Manipulation instructions. Moreover, we successfully mitigate hallucination by finetuning MiniGPT4 and mPLUG-Owl on LRV-Instruction while improving performance on several public datasets compared to state-of-the-art methods. Additionally, we observed that a balanced ratio of positive and negative instances in the training data leads to a more robust model. Code and data are available at https://github.com/FuxiaoLiu/LRV-Instruction.

PDF HTML Abstract

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

The research paper, titled "Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning," addresses the issue of hallucination in large multi-modal models (LMMs), specifically focusing on models' propensity to create inconsistent descriptions given the associated image and human instructions. This work introduces a novel approach involving the development of a robust visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction, comprising 400,000 instructions covering 16 vision-and-language tasks. The instructions are generated by GPT-4, emphasizing both positive and negative samples across varied semantic levels: Nonexistent Object Manipulation, Existent Object Manipulation, and Knowledge Manipulation.

The paper highlights the significance of addressing hallucination in LMMs—an issue that not only leads to ethical concerns but can also cause users to over-rely on models, falsely interpreting them as accurate sources of information. LMMs tend to generate inconsistent and incorrect visual descriptions, such as "a dog playing with a ball" when neither exists in the image. The contributing factors, as hypothesized, include a reliance on language priors and synthetic instruction data, which often involve nonexistent or irrelevant elements not pertinent to the given image.

The paper introduces LRV-Instruction to combat these issues by incorporating both positive and negative instructions for robust training. Notably, negative instructions in the dataset are constructed at three semantic levels and two different formats, declarative and interrogative, providing a more comprehensive set of scenarios to test LMMs' capabilities. The authors leverage the GPT-4-Assisted Visual Instruction Evaluation (GAVIE) to evaluate hallucination without depending on human-annotated groundtruth answers—a method demonstrated to align closely with human expert evaluations.

The key findings indicate that existing LMMs, such as MiniGPT4 and mPLUG-Owl, suffer significant hallucination issues under negative instructions, particularly with existent objects and knowledge manipulations. However, when trained on LRV-Instruction, these models demonstrate improved robustness, exhibiting reduced hallucination and increased performance across various tasks beyond that of previous state-of-the-art techniques. Additionally, the results reveal the importance of a balanced dataset ratio of positive and negative samples, aligned with the model's improved ability to generate accurate responses when such balance is achieved.

Implications of this research extend to both theoretical and practical domains. Theoretically, it challenges current assumptions about LMM training, proposing an alternative paradigm that incorporates diverse datasets and explicitly accounts for negative examples. Practically, the paper propels advancements in multimedia applications, where accurate model interpretation is vital, such as in automated image description or human-computer interaction. The provision of LRV-Instruction offers a critical resource that may support future developments in more generalized, robust LMMs.

Future research directions could explore extending the diversity of visual instruction datasets further and integrating stronger vision encoders to improve fine-grained visual understanding capabilities. Additionally, investigating and mitigating other biases, including those not covered by vision-and-language tasks, could enhance the overall robustness of LMMs, ultimately leading to more reliable, trustworthy AI systems.

PDF Markdown Bookmark Chat (Pro)

References (44)

Authors (6)

Fuxiao Liu (17 papers)
Kevin Lin (98 papers)
Linjie Li (89 papers)
Jianfeng Wang (149 papers)
Yaser Yacoob (11 papers)
Lijuan Wang (133 papers)

Citations (170)

View on Semantic Scholar

GitHub

GitHub - FuxiaoLiu/LRV-Instruction: [ICLR'24] Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning (252 stars)

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning (2306.14565v4)

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Related Papers

GitHub