Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding (2403.00425v2)

Published 1 Mar 2024 in cs.CV, cs.AI, and cs.LG
HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding

Abstract: While large vision-LLMs (LVLMs) have demonstrated impressive capabilities in interpreting multi-modal contexts, they invariably suffer from object hallucinations (OH). We introduce HALC, a novel decoding algorithm designed to mitigate OH in LVLMs. HALC leverages distinct fine-grained optimal visual information in vision-language tasks and operates on both local and global contexts simultaneously. Specifically, HALC integrates a robust auto-focal grounding mechanism (locally) to correct hallucinated tokens on the fly, and a specialized beam search algorithm (globally) to significantly reduce OH while preserving text generation quality. Additionally, HALC can be integrated into any LVLMs as a plug-and-play module without extra training. Extensive experimental studies demonstrate the effectiveness of HALC in reducing OH, outperforming state-of-the-arts across four benchmarks.

HALC: A Novel Approach to Mitigate Object Hallucination in Vision-LLMs

Introduction

The development of vision-LLMs (VLMs) stands as a significant advancement in the intersection of NLP and computer vision (CV), facilitating the comprehensive interpretation of multimodal data. However, object hallucination (OH) emerges as a profound challenge within this domain, leading to the generation of inaccurately described or nonexistent objects. This issue persists even with large vision-LLMs (LVLMs) despite their enhanced capabilities. The paper introduces HALC (Object Hallucination Reduction through Adaptive FocaL-Contrast decoding), a decoding strategy designed to address OH across all its types—existence, attribute, and relationship hallucinations—while maintaining textual generation quality. HALC distinguishes itself by effectively leveraging fine-grained visual information and balancing the mitigation of OH with the preservation of narrative coherence.

Related Work

Existing strategies to confront OH predominantly concentrate on object existence hallucinations, often neglecting attribute and relationship levels. Approaches such as post-hoc correction, self-correction pipelines, and various decoding strategies aim at reducing OH by harnessing better textual or visual priors. However, these methods either require additional data, external powerful LVLMs, or result in complex adaptation processes that hinder their applicability. The significance of addressing OH, coupled with the limitations in current methodologies, underscores the necessity for novel solutions like HALC.

Methodology

HALC operates by identifying tokens related to potential OH sources and utilizing an adaptive focal-contrast grounding mechanism for fine-grained visual information processing. This dual-level approach—addressing both local and global contexts—enables the algorithm to correct hallucinated tokens dynamically during text generation. HALC incorporates:

  • Object-related Token Identification: This step pinpoints tokens likely to induce OH, based on their syntactic categories, for subsequent processing.
  • Visual Context Retrieval: Utilizing zero-shot detectors, HALC identifies the visual context related to the currently generated token, even when representing potentially hallucinated elements.
  • Adaptive Focal-contrast Grounding: Through a novel mechanism, HALC samples and selects contrasting fields of view (FOVs) based on their influence on token output, aiming to approximate optimal visual contexts for token generation.
  • Matching-based Beam Search: On a global level, HALC employs a beam search algorithm guided by a visual matching score, ensuring that selected text sequences closely align with the original visual input.

Theoretical Analysis

The paper provides a theoretical framework for HALC's FOV sampling strategy, demonstrating its effectiveness in approximating optimal visual contexts for reduced OH. Through empirical analysis, HALC's method of dynamically selecting visual contexts proves superior in mitigating hallucinated content.

Experimental Analysis

Extensive testing across various benchmarks—MSCOCO, MME, and LLaVA-Bench—demonstrates HALC's efficacy in significantly reducing OH across all types. HALC consistently outperforms existing SOTAs and baseline methods in these evaluations, offering a robust solution to the object hallucination problem without compromising text generation quality.

Conclusion

HALC presents a groundbreaking strategy for reducing OH in LVLMs by effectively balancing the use of fine-grained visual information and textual generation quality. Its comprehensive approach, applicability to a broad range of LVLMs, and superior performance underscore its potential to advance the field of vision-LLM development. The open-source availability of HALC, combined with a unified benchmarking platform, further facilitates future research and application in this critical area of paper.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Guided open vocabulary image captioning with constrained beam search. arXiv preprint arXiv:1612.00576, 2016.
  3. Let there be a clock on the beach: Reducing object hallucination in image captioning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.  1381–1390, 2022.
  4. Leveraging sentence similarity in natural language generation: Improving beam search using range voting. arXiv preprint arXiv:1908.06288, 2019.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023.
  7. Dola: Decoding by contrasting layers improves factuality in large language models. arXiv preprint arXiv:2309.03883, 2023.
  8. Holistic analysis of hallucination in gpt-4v (ision): Bias and interference challenges. arXiv preprint arXiv:2311.03287, 2023.
  9. Plausible may not be faithful: Probing object hallucination in vision-language pre-training. arXiv preprint arXiv:2210.07688, 2022.
  10. Instructblip: Towards general-purpose vision-language models with instruction tuning. ArXiv, abs/2305.06500, 2023. URL https://api.semanticscholar.org/CorpusID:258615266.
  11. On the limitations of multimodal vaes. arXiv preprint arXiv:2110.04121, 2021.
  12. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
  13. The benefits of bad advice: Autocontrastive decoding across model layers. arXiv preprint arXiv:2305.01628, 2023.
  14. Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models. arXiv e-prints, pp.  arXiv–2310, 2023.
  15. Detecting and preventing hallucinations in large vision language models. arXiv preprint arXiv:2308.06394, 2023.
  16. spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing. To appear, 7(1):411–420, 2017.
  17. Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. arXiv preprint arXiv:2311.17911, 2023.
  18. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. arXiv preprint arXiv:2311.16922, 2023.
  19. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pp.  12888–12900. PMLR, 2022a.
  20. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
  21. Contrastive decoding: Open-ended text generation as optimization. arXiv preprint arXiv:2210.15097, 2022b.
  22. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023.
  23. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp.  740–755. Springer, 2014.
  24. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
  25. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b.
  26. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023c.
  27. Vision-and-language pretrained models: A survey. arXiv preprint arXiv:2204.07356, 2022.
  28. Scaling open-vocabulary object detection. arXiv preprint arXiv:2306.09683, 2023.
  29. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp.  311–318, 2002.
  30. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  31. Object hallucination in image captioning. arXiv preprint arXiv:1809.02156, 2018.
  32. Trusting your evidence: Hallucinate less with context-aware decoding. arXiv preprint arXiv:2305.14739, 2023.
  33. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  34. How many unicorns are in this image? a safety evaluation benchmark for vision llms. arXiv preprint arXiv:2311.16101, 2023.
  35. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  4566–4575, 2015.
  36. Mementos: A comprehensive benchmark for multimodal large language model reasoning over image sequences. arXiv preprint arXiv:2401.10529, 2024.
  37. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pp.  38–45, 2020.
  38. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257, 2023.
  39. Woodpecker: Hallucination correction for multimodal large language models. arXiv preprint arXiv:2310.16045, 2023.
  40. Halle-switch: Controlling object hallucination in large vision language models. arXiv e-prints, pp.  arXiv–2310, 2023.
  41. Analyzing and mitigating object hallucination in large vision-language models. arXiv preprint arXiv:2310.00754, 2023.
  42. Aligning modalities in vision large language models via preference fine-tuning. arXiv preprint arXiv:2402.11411, 2024.
  43. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Zhaorun Chen (28 papers)
  2. Zhuokai Zhao (21 papers)
  3. Hongyin Luo (31 papers)
  4. Huaxiu Yao (103 papers)
  5. Bo Li (1107 papers)
  6. Jiawei Zhou (77 papers)
Citations (33)
X Twitter Logo Streamline Icon: https://streamlinehq.com