MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation (2410.11779v1)

Published 15 Oct 2024 in cs.CL, cs.AI, cs.CV, cs.LG, and cs.MM

Abstract: Multimodal LLMs (MLLMs) frequently exhibit hallucination phenomena, but the underlying reasons remain poorly understood. In this paper, we present an empirical analysis and find that, although MLLMs incorrectly generate the objects in the final output, they are actually able to recognize visual objects in the preceding layers. We speculate that this may be due to the strong knowledge priors of the LLM suppressing the visual information, leading to hallucinations. Motivated by this, we propose a novel dynamic correction decoding method for MLLMs (DeCo), which adaptively selects the appropriate preceding layers and proportionally integrates knowledge into the final layer to adjust the output logits. Note that DeCo is model agnostic and can be seamlessly incorporated with various classic decoding strategies and applied to different MLLMs. We evaluate DeCo on widely-used benchmarks, demonstrating that it can reduce hallucination rates by a large margin compared to baselines, highlighting its potential to mitigate hallucinations. Code is available at https://github.com/zjunlp/DeCo.

PDF HTML Abstract

Dynamic Correction Decoding for Hallucination Mitigation in Multimodal LLMs

Overview

The paper "MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation" addresses the prevalent issue of hallucination in Multimodal LLMs (MLLMs). MLLMs frequently generate inaccurate depictions of nonexistent objects, which poses risks across various applications, particularly in sensitive areas like medical imaging and autonomous systems. The research presented in this paper explores the underlying mechanisms of such hallucinations, offering new insights and methodologies for mitigating these errors.

Key Findings

The authors identify that MLLMs, contrary to some assumptions, do recognize visual objects in their intermediate layers. The hallucination problem arises when the strong priors of the LLM suppress this visual recognition in the later stages of processing. This results in inaccurate outputs where nonexistent objects are described. The observation suggests that earlier layers provide a more accurate representation of visual inputs compared to the final layers where linguistic biases take precedence.

Methodology

The paper introduces a novel approach called Dynamic Correction Decoding (DeCo), which aims to reduce hallucinations by leveraging the representations from preceding layers of MLLMs. DeCo operates on the hypothesis that earlier layers have higher confidence in ground truth tokens. Therefore, integrating earlier layer outputs with the final layer outputs can optimize the model’s predictions. This model-agnostic approach can complement various decoding strategies like greedy search and beam search.

DeCo dynamically selects a preceding layer as an "anchor" to guide the final prediction layer. Through empirical analysis, the authors determine the most effective range of layers to draw from, ensuring the dynamic correction process is both accurate and computationally efficient. The integration of dynamic soft modulation further refines the predictions, maintaining the stylistic nuances while correcting for factual accuracy.

Results

The experimental validation of DeCo spans multiple benchmarks, including image captioning and visual question answering (VQA) datasets. The results indicate an average suppression rate of hallucinated outputs by approximately 10.8% compared to baseline methods. The paper evaluates models such as InstructBLIP, MiniGPT-4, LLaVA, and Qwen-VL, demonstrating DeCo’s wide applicability across different architectures.

In addition to reducing hallucinations, DeCo maintains computational feasibility. The method introduces a latency increase of approximately 1.2 times, significantly lower than other correction methods like VCD and OPERA, thus underscoring DeCo's practicality for real-world applications.

Implications and Future Directions

The findings and proposed method contribute significantly to our understanding and ability to mitigate hallucinations in MLLMs. The authors suggest that addressing LLM biases is crucial for improving visual accuracy in MLLMs. The insights could guide further explorations into layer-specific information retention strategies and deeper integration of visual priors with LLMs.

Future research might extend this exploration into larger and more complex MLLM architectures, assessing whether the layer dynamics in hallucination observed in this paper persist in larger models. Additionally, integrating DeCo with other existing hallucination mitigation techniques could yield further improvements, enhancing the overall robustness of MLLMs in various application domains.

In summary, this paper illuminates a path towards more accurate and reliable MLLMs by leveraging dynamic layer-specific knowledge correction, effectively enhancing model outputs where image-text associations are critical.