HALC: A Novel Approach to Mitigate Object Hallucination in Vision-LLMs
Introduction
The development of vision-LLMs (VLMs) stands as a significant advancement in the intersection of NLP and computer vision (CV), facilitating the comprehensive interpretation of multimodal data. However, object hallucination (OH) emerges as a profound challenge within this domain, leading to the generation of inaccurately described or nonexistent objects. This issue persists even with large vision-LLMs (LVLMs) despite their enhanced capabilities. The paper introduces HALC (Object Hallucination Reduction through Adaptive FocaL-Contrast decoding), a decoding strategy designed to address OH across all its types—existence, attribute, and relationship hallucinations—while maintaining textual generation quality. HALC distinguishes itself by effectively leveraging fine-grained visual information and balancing the mitigation of OH with the preservation of narrative coherence.
Related Work
Existing strategies to confront OH predominantly concentrate on object existence hallucinations, often neglecting attribute and relationship levels. Approaches such as post-hoc correction, self-correction pipelines, and various decoding strategies aim at reducing OH by harnessing better textual or visual priors. However, these methods either require additional data, external powerful LVLMs, or result in complex adaptation processes that hinder their applicability. The significance of addressing OH, coupled with the limitations in current methodologies, underscores the necessity for novel solutions like HALC.
Methodology
HALC operates by identifying tokens related to potential OH sources and utilizing an adaptive focal-contrast grounding mechanism for fine-grained visual information processing. This dual-level approach—addressing both local and global contexts—enables the algorithm to correct hallucinated tokens dynamically during text generation. HALC incorporates:
- Object-related Token Identification: This step pinpoints tokens likely to induce OH, based on their syntactic categories, for subsequent processing.
- Visual Context Retrieval: Utilizing zero-shot detectors, HALC identifies the visual context related to the currently generated token, even when representing potentially hallucinated elements.
- Adaptive Focal-contrast Grounding: Through a novel mechanism, HALC samples and selects contrasting fields of view (FOVs) based on their influence on token output, aiming to approximate optimal visual contexts for token generation.
- Matching-based Beam Search: On a global level, HALC employs a beam search algorithm guided by a visual matching score, ensuring that selected text sequences closely align with the original visual input.
Theoretical Analysis
The paper provides a theoretical framework for HALC's FOV sampling strategy, demonstrating its effectiveness in approximating optimal visual contexts for reduced OH. Through empirical analysis, HALC's method of dynamically selecting visual contexts proves superior in mitigating hallucinated content.
Experimental Analysis
Extensive testing across various benchmarks—MSCOCO, MME, and LLaVA-Bench—demonstrates HALC's efficacy in significantly reducing OH across all types. HALC consistently outperforms existing SOTAs and baseline methods in these evaluations, offering a robust solution to the object hallucination problem without compromising text generation quality.
Conclusion
HALC presents a groundbreaking strategy for reducing OH in LVLMs by effectively balancing the use of fine-grained visual information and textual generation quality. Its comprehensive approach, applicability to a broad range of LVLMs, and superior performance underscore its potential to advance the field of vision-LLM development. The open-source availability of HALC, combined with a unified benchmarking platform, further facilitates future research and application in this critical area of paper.