Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Object Hallucination in Image Captioning (1809.02156v2)

Published 6 Sep 2018 in cs.CL and cs.CV

Abstract: Despite continuously improving performance, contemporary image captioning models are prone to "hallucinating" objects that are not actually in a scene. One problem is that standard metrics only measure similarity to ground truth captions and may not fully capture image relevance. In this work, we propose a new image relevance metric to evaluate current models with veridical visual labels and assess their rate of object hallucination. We analyze how captioning model architectures and learning objectives contribute to object hallucination, explore when hallucination is likely due to image misclassification or language priors, and assess how well current sentence metrics capture object hallucination. We investigate these questions on the standard image captioning benchmark, MSCOCO, using a diverse set of models. Our analysis yields several interesting findings, including that models which score best on standard sentence metrics do not always have lower hallucination and that models which hallucinate more tend to make errors driven by language priors.

Analysis of Object Hallucination in Image Captioning Models

The paper "Object Hallucination in Image Captioning" addresses a critical shortcoming in current image captioning models, namely their tendency to hallucinate objects that are not present in the scene. This phenomenon raises concerns about the reliability of image captioning systems, especially in scenarios where accuracy is paramount, such as assistive technologies for the visually impaired.

Data and Metric Analysis

To rigorously evaluate object hallucination, the authors introduce a novel metric, CHAIR (Caption Hallucination Assessment with Image Relevance), which measures the relevance of generated captions with respect to the actual image content. The metric is divided into two variants: CHAIRi, which assesses hallucination on a per-instance basis, and CHAIRs, which evaluates incidence on a per-sentence basis. These metrics are applied to analyze diverse captioning models on benchmarks such as MSCOCO.

The analysis reveals that models optimized for standard sentence metrics like CIDEr do not necessarily exhibit lower hallucination rates. In some cases, the optimization for CIDEr exacerbates hallucination. This finding is supported by quantitative results showing models with high CIDEr scores can simultaneously demonstrate high hallucination rates under CHAIR metrics. Moreover, the results suggest that hallucinating captions often reflect errors driven by language priors rather than image content.

Models Prone to Hallucination

The paper reviews several architectures, including those with attention mechanisms such as TopDown and Neural Baby Talk, emphasizing the role of attention and object detection in reducing hallucination. Models without attention mechanisms, like the FC and LRCN, tend to hallucinate more. Additionally, it is noted that models trained with the self-critical loss, which directly optimizes CIDEr score, often suffer increased hallucination.

Influence of Visual and Linguistic Features

A key contribution of this paper is the analysis of visual and linguistic consistency in captioning errors. Models that make hallucination errors consistent with image-derived predictions tend to perform better in reducing hallucination. In contrast, hallucination errors strongly aligned with LLMs suggest over-reliance on language priors.

The research also provides a deconstruction of the TopDown architecture to reveal how specific components influence model behavior. When spatial features or attention mechanisms are removed, hallucination rates tend to increase, highlighting the importance of these features in maintaining image relevance in captions.

Implications and Future Insights

This paper offers significant implications for the development of captioning models. It suggests that visual grounding, such as incorporating bounding box attention and object detection, should be prioritized to mitigate hallucination risks. Moreover, reliance on sentence-level metrics for model evaluation should be balanced with image relevance metrics like CHAIR to ensure models genuinely understand the image context.

The theoretical implications suggest future research should focus on designing models that balance language fluency with robust visual comprehension. Furthermore, exploring how integrating auxiliary image-understanding tasks—such as object detection and segmentation—can enhance captioning performance might be a promising avenue of advancement.

In summary, this work provides a comprehensive examination of object hallucination in image captioning, offering insights that bridge the gap between numerical evaluation metrics and the real-world applicability of image captioning models. By introducing the CHAIR metric, the authors present a tool that can critically assess a model's ability to maintain image relevance, encouraging the development of more robust and reliable captioning systems.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Anna Rohrbach (53 papers)
  2. Lisa Anne Hendricks (37 papers)
  3. Kaylee Burns (14 papers)
  4. Trevor Darrell (324 papers)
  5. Kate Saenko (178 papers)
Citations (334)