Object Hallucination in Image Captioning

Updated 28 April 2026

Object hallucination in image captioning is the phenomenon where models generate references to objects that are absent from the input image due to over-reliance on language priors.
Advanced metrics such as CHAIR, ALOHa, and DENEB assess hallucination by measuring caption relevance and semantic similarity in quantitatively robust ways.
Mitigation strategies include decoding-time corrections, training-time alignment, and hybrid prompting that collectively enhance visual grounding while balancing caption richness.

Object hallucination in image captioning is the phenomenon wherein models generate references to objects absent from the input image, resulting in spurious or unfaithful textual descriptions. This issue persists across state-of-the-art Large Vision-LLMs (LVLMs), undermining reliability, factual grounding, and downstream utility in safety-critical domains. Object hallucination is intricately linked to model architecture, decoding protocol, training data bias, and multimodal integration.

1. Definitions, Metrics, and Taxonomy

Object hallucination is defined as the occurrence of mentions of objects, entities, or object attributes in a caption that do not appear in the input image. The seminal CHAIR metrics (“Caption Hallucination Assessment with Image Relevance”) formalize this at two levels: instance-level (CHAIR $_i$ ), measuring the fraction of hallucinated object mentions, and sentence-level (CHAIR $_s$ ), measuring the fraction of generated captions containing at least one hallucinated object (Rohrbach et al., 2018). Formally: $\text{CHAIR}_i = \frac{\#\{\text{hallucinated object mentions}\}}{\#\{\text{all object mentions}\}};\quad \text{CHAIR}_s = \frac{\#\{\text{captions with %%%%2%%%% hallucination}\}}{\#\{\text{captions}\}}$ Subsequent work extends this taxonomy to open-vocabulary (rare and unseen) objects (Ben-Kish et al., 2023), and distinguishes fine-grained hallucination at the attribute/action level as well as object existence (Wang et al., 2023). Contemporary metrics integrate LLM-based noun-phrase extraction and semantic similarity (ALOHa, CAOS, DENEB) to move beyond strict string-matching, supporting evaluation across arbitrary vocabularies and nuanced contexts (Petryk et al., 2024, Datta et al., 25 Jan 2025, Matsuda et al., 2024).

2. Mechanisms and Causes of Hallucination

A consensus across empirical analyses is that object hallucination typically arises from an over-reliance on language priors in the decoding process—whereby generative models, as caption generation progresses, gradually “forget” or disregard the image signal and default to statistically likely object continuations (Sun et al., 24 Feb 2025, Favero et al., 2024, Feng et al., 2024). Empirical findings show a monotonic increase in token-level hallucination rate with sequential position: hallucinated tokens are rare in early decoding but dominate among the final ~25% of generated tokens, indicating a “snowball effect” where early errors anchor later factual drift (Sun et al., 24 Feb 2025). This drift is measurable: Jensen-Shannon divergence, or related prompt dependency measures, between token distributions with and without visual input sharply decreases as generation advances—concretely, >50% of tokens in typical LVLMs show JSD $<$ 0.1 in the latter half of the caption, and object token probabilities are increasingly determined by autoregressive context, not visual features (Sun et al., 24 Feb 2025, Favero et al., 2024).

Additional factors exacerbating hallucination include modality bias (dominance of LLM prior over vision backbone, especially in cross-domain settings) (Fei et al., 2023), limited spatial resolution or weak patch–word alignment in the vision encoder (Dai et al., 2022), and ground-truth mismatches or under-specified human reference texts that fail to penalize plausible but unsupported mentions (Feng et al., 2024). Fixed response templates learnt during instruction tuning also contribute to repetitive, weakly grounded hallucinations (Sun et al., 24 Feb 2025).

3. Evaluation Methodologies and Their Limitations

Standard metrics including BLEU, ROUGE, SPICE, and CIDEr offer limited sensitivity to hallucination, as they assess n-gram or scene-graph overlap between prediction and reference without accessing the image. This allows hallucinated objects to “disappear in the average” if supported by references. CHAIR mitigates some of this by comparing to segmentation/classification ground-truth; however, its restricted object list (e.g., 80 MSCOCO classes), coarse synonym rules, and lack of context/sequential modeling limit recall and precision, especially for novel entities or indirect attributions (Rohrbach et al., 2018, Petryk et al., 2024).

To address this, several open-vocabulary and embedding-based metrics have emerged. ALOHa uses LLMs to extract groundable noun phrases, computes sentence-BERT or CLIP-based semantic similarity, and applies bipartite Hungarian matching for robust object-level hallucination scoring, resulting in improved detection on expert-annotated benchmarks (+13.6% on HAT, +30.8% on nocaps-FOIL compared to CHAIR) (Petryk et al., 2024). CAOS introduces context-aware cosine-similarity metrics analyzing both in- and out-of-domain nouns, sequential dynamics, and semantic proximity to frequent dataset classes, revealing whether hallucinations stem from language priors, prior caption context, or genuine visual confusion (Datta et al., 25 Jan 2025).

Recent metrics such as DENEB propose a trained, reference- and image-similarity transformer (Sim-Vec), directly regressed to human judgments, providing strong hallucination detection and ranking accuracy across multiple captioning datasets (Matsuda et al., 2024). These new metrics systematically outperform classical approaches and facilitate open-domain, fine-grained hallucination quantification.

4. Mitigation Strategies and Decoding Protocols

Mitigation of object hallucination in image captioning centers around three paradigms:

A. Decoding-time Correction:

Methods such as Multi-Modal Mutual Information Decoding (M3ID) actively amplify the logit difference between vision-conditioned and unconditioned language distributions at each decoding step, thus maximizing the mutual information between caption and image (Favero et al., 2024). This reduces CHAIR $_i$ by 25–28% (e.g., from 7.4% to 5.3% for LLaVA-13B), and similarly boosts factual accuracy in visual question answering (Favero et al., 2024). Other approaches deploy token-level classifiers to detect low image-dependence via parallel decoding (with/without vision features), then select candidate sentences with high predicted “accuracy” measured by token-level image dependency (Sun et al., 24 Feb 2025). Caption-Sensitive Attention Intervention (CAI) exploits differences in attention activation between caption and non-caption queries, applying residual-state shifts to key attention heads during inference to boost visual grounding with minimal extra cost (Li et al., 30 Jun 2025).

B. Training-time Alignment:

Fine-grained token-level alignment losses, such as ObjMLM (object-masked language modeling), directly force models to recover masked object words from the image context, reducing sentence-level hallucination by up to 17.4% (Dai et al., 2022). Consensus reasoning frameworks—aligning language and vision scene graphs into a fused representation—improve both grounding and hallucination rates (Zhang et al., 2021). Reinforcement learning with multi-objective reward functions, explicitly balancing fidelity and adequacy, as in MOCHa, further optimize open-vocabulary settings with strong empirical performance (Ben-Kish et al., 2023). Caption rewriting and fine-tuning on diverse, controlled captions (ReCaption) target fine-grained hallucination at attribute/behavior level, with demonstrated improvements in precision and factual alignment (Wang et al., 2023).

C. Hybrid Prompting and Decoding:

Entity-aware prompting (e.g., ViECap) injects explicit object/entity lists derived from CLIP or DETR into the decoding prefix, anchoring LLM attention to detected visual entities and mitigating cross-domain hallucination (Fei et al., 2023). Self-validation frameworks rely on the model’s own vision priors—inference-time “language-prior-free verification” of candidate objects using object-only prompts—that enable selection or aggregation of hallucination-free captions, yielding up to 65.6% CHAIR $_i$ reduction (Liu et al., 30 Jan 2026). Differentiated Beam Decoding parallelizes “unit fact” generation, balancing coverage and hallucination through diversity-oriented search and CLIP-based precision/recall scoring (Feng et al., 2024).

A comparison of representative approaches is given below:

Methodology	Core Principle	Reported CHAIR $_i$ Reduction (%)	Open-Vocabulary Support
M3ID/DPO (Favero et al., 2024)	Decoding-time MI maximization	~25–28	Yes
Token-dep classifier (Sun et al., 24 Feb 2025)	Decoding with image-dependency filter	61.1 (high accuracy)	Yes
MOCHa (Ben-Kish et al., 2023)	RL with NLI & BERTScore rewards	-0.3 absolute (relative for BLIP-2); 0.4 OpenCHAIR	Yes
ReCaption (Wang et al., 2023)	Caption rewrites + finetuning	3–10 F1 points improvement (fine-grained)	Yes
Self-validation (Liu et al., 30 Jan 2026)	LPFV, BoN/FtA selection	~50–80% (varied baselines)	Yes
Entity prompts (Fei et al., 2023)	CLIP-detected hard prompts	4–5 points gain (entity precision)	Yes
CAI (Li et al., 30 Jun 2025)	Attention intervention	12–13% (MMHal, POPE F1)	Yes
CCA (Xing et al., 2024)	Concentric position, causal mask	1–2 absolute points (CHAIR $_i$ )	Yes

5. Architectural and Data-level Factors

Architectural enhancements strongly modulate hallucination propensity. Explicit object- or region-level alignment—TopDown-BB attention, Neural Baby Talk, consensus graph integration—consistently reduce CHAIR metrics by ~2–4% compared to purely global or sequence-pooling approaches (Rohrbach et al., 2018, Zhang et al., 2021). Patch-based vision backbones (ViT-B/16, ViT-L/14) with smaller patch sizes exhibit systematically lower hallucination rates than region/CNN-grid encoders (Dai et al., 2022). Concentric Causal Attention (CCA) reduces RoPE-induced long-term decay by flattening 2D grids concentrically, thus decreasing visual–text token distance and minimizing position-induced hallucination (Xing et al., 2024).

Dataset contamination, especially when evaluating on splits (e.g., MSCOCO) also seen by models during pretraining/fine-tuning, underestimates hallucination: out-of-distribution evaluation (e.g., Objects365, nocaps) reveals CHAIR $_i$ increases by 2–3x, and open-vocabulary testbeds (OpenCHAIR, HAT, nocaps-FOIL) expose additional hallucination modes missed by fixed lists (Geigle et al., 2024, Ben-Kish et al., 2023, Petryk et al., 2024). Longer captions, increased granularity, or exhaustive instruction tuning can increase hallucination if not accompanied by enhanced visual grounding (as shown by HallE-Control, where control of the imagination/contextual ratio can reduce CHAIR $_i$ by up to 60%) (Zhai et al., 2023).

6. Implications, Limitations, and Future Directions

Object hallucination remains the primary factuality bottleneck for deployment of captioning models in trust-critical domains. While rule-based and early attention/grounding strategies reduced hallucination rates to ~5–10% on in-domain data, open-vocabulary and fine-grained benchmarks reveal persistent errors at higher rates. Most established mitigation strategies yield substantial, but not uniform, improvements and often entail trade-offs in recall or caption richness.

Persistent challenges include: (1) extending hallucination detection and mitigation to attributes, behaviors, and inter-object relationships; (2) minimizing reduction in object/attribute coverage when lowering hallucinations; (3) scaling mitigation methods for real-time or resource-constrained deployments; (4) engineering metrics and frameworks robust to evaluation prompt, domain, and object vocabulary drift (Ben-Kish et al., 2023, Petryk et al., 2024, Matsuda et al., 2024).

Ongoing research avenues encompass reference-free and region-level hallucination detection, adversarial and contrastive pre-training objectives, tighter integration of LLM-based object verifiers and fusion modules, and the design of architectural interventions (e.g., CCA, CAI) that can generalize across multi-modal tasks and architectures (Li et al., 30 Jun 2025, Xing et al., 2024).

Object hallucination in image captioning thus remains a dynamic field, requiring continued methodological and theoretical innovation at the intersection of vision, language, and machine learning.