Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
The research paper, titled "Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning," addresses the issue of hallucination in large multi-modal models (LMMs), specifically focusing on models' propensity to create inconsistent descriptions given the associated image and human instructions. This work introduces a novel approach involving the development of a robust visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction, comprising 400,000 instructions covering 16 vision-and-language tasks. The instructions are generated by GPT-4, emphasizing both positive and negative samples across varied semantic levels: Nonexistent Object Manipulation, Existent Object Manipulation, and Knowledge Manipulation.
The paper highlights the significance of addressing hallucination in LMMs—an issue that not only leads to ethical concerns but can also cause users to over-rely on models, falsely interpreting them as accurate sources of information. LMMs tend to generate inconsistent and incorrect visual descriptions, such as "a dog playing with a ball" when neither exists in the image. The contributing factors, as hypothesized, include a reliance on language priors and synthetic instruction data, which often involve nonexistent or irrelevant elements not pertinent to the given image.
The paper introduces LRV-Instruction to combat these issues by incorporating both positive and negative instructions for robust training. Notably, negative instructions in the dataset are constructed at three semantic levels and two different formats, declarative and interrogative, providing a more comprehensive set of scenarios to test LMMs' capabilities. The authors leverage the GPT-4-Assisted Visual Instruction Evaluation (GAVIE) to evaluate hallucination without depending on human-annotated groundtruth answers—a method demonstrated to align closely with human expert evaluations.
The key findings indicate that existing LMMs, such as MiniGPT4 and mPLUG-Owl, suffer significant hallucination issues under negative instructions, particularly with existent objects and knowledge manipulations. However, when trained on LRV-Instruction, these models demonstrate improved robustness, exhibiting reduced hallucination and increased performance across various tasks beyond that of previous state-of-the-art techniques. Additionally, the results reveal the importance of a balanced dataset ratio of positive and negative samples, aligned with the model's improved ability to generate accurate responses when such balance is achieved.
Implications of this research extend to both theoretical and practical domains. Theoretically, it challenges current assumptions about LMM training, proposing an alternative paradigm that incorporates diverse datasets and explicitly accounts for negative examples. Practically, the paper propels advancements in multimedia applications, where accurate model interpretation is vital, such as in automated image description or human-computer interaction. The provision of LRV-Instruction offers a critical resource that may support future developments in more generalized, robust LMMs.
Future research directions could explore extending the diversity of visual instruction datasets further and integrating stronger vision encoders to improve fine-grained visual understanding capabilities. Additionally, investigating and mitigating other biases, including those not covered by vision-and-language tasks, could enhance the overall robustness of LMMs, ultimately leading to more reliable, trustworthy AI systems.