Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ALOHa: A New Measure for Hallucination in Captioning Models (2404.02904v1)

Published 3 Apr 2024 in cs.CV, cs.AI, cs.CL, and cs.LG
ALOHa: A New Measure for Hallucination in Captioning Models

Abstract: Despite recent advances in multimodal pre-training for visual description, state-of-the-art models still produce captions containing errors, such as hallucinating objects not present in a scene. The existing prominent metric for object hallucination, CHAIR, is limited to a fixed set of MS COCO objects and synonyms. In this work, we propose a modernized open-vocabulary metric, ALOHa, which leverages LLMs to measure object hallucinations. Specifically, we use an LLM to extract groundable objects from a candidate caption, measure their semantic similarity to reference objects from captions and object detections, and use Hungarian matching to produce a final hallucination score. We show that ALOHa correctly identifies 13.6% more hallucinated objects than CHAIR on HAT, a new gold-standard subset of MS COCO Captions annotated for hallucinations, and 30.8% more on nocaps, where objects extend beyond MS COCO categories. Our code is available at https://davidmchan.github.io/aloha/.

ALOHa: Advancing Object Hallucination Detection in Image Captioning with LLMs

Introduction to ALOHa and Its Contributions

Recent methodologies in vision-LLMs have pushed the boundaries of image caption generation. Despite these advancements, the generation of captions with object hallucinations — descriptors of objects absent in the image — remains a significant challenge. The paper presents ALOHa (A New Measure for Hallucination in Captioning Models), a novel approach leveraging LLMs to identify and quantify object hallucinations in captions more effectively compared to existing metrics. By incorporating semantic understanding and flexible object detection, ALOHa represents a step forward in evaluating and improving the reliability of automated caption generation systems.

Understanding Object Hallucination Metrics

Prior methods for detecting object hallucinations in image captions, such as CHAIR, rely on a fixed set of objects and their synonyms from existing datasets like MS COCO. This approach, while effective within its domain, lacks generalizability to captions pertaining to objects beyond the predefined set. ALOHa introduces an open-vocabulary metric that extends beyond these limitations by utilizing LLMs for object extraction and semantic similarity measures, thereby accommodating a wider array of objects and scenarios.

Methodological Innovations of ALOHa

ALOHa method involves several key stages:

  • Object Extraction: It employs an LLM to parse visually grounded objects from both the candidate caption and reference materials, adjusting for context, ambiguity, and uncertain language.
  • Object Set Refinement and Semantical Representation: The method refines extracted object sets, considering uncertainties and conjunctions in captions, and computes their semantic representations.
  • Object Matching and Hallucination Scoring: It utilizes Hungarian matching to assign scores to each object in the candidate caption based on their semantic similarity to reference objects. ALOHa generates both object-level and caption-level hallucination scores, providing fine-grained insights into the presence and extent of hallucinations.

Evaluating ALOHa

ALOHa's efficacy is demonstrated through extensive evaluations. When compared to CHAIR and other existing metrics like CLIPScore, ALOHa shows superior performance in detecting hallucinated objects, with significant improvements shown on the HAT dataset (a new, gold-standard dataset introduced alongside ALOHa for hallucination annotation) and on nocaps, especially for objects beyond the MS COCO categories. Such results underline ALOHa's enhanced generalizability and adaptability to different contexts.

Implications and Future Directions

The introduction of ALOHa has several significant implications for the field:

  • It highlights the potential of using LLMs not just for content generation but also for evaluative and analytical tasks in multimodal contexts.
  • ALOHa's open-vocabulary approach opens new avenues for caption evaluation across more diverse datasets, crucial for developing systems with wide applicational breadth.
  • The nuanced understanding and detection of hallucinations ALOHa provides can be vital for enhancing the reliability and trustworthiness of automated captioning systems, especially in critical areas where accuracy is paramount.

Future research could explore extending ALOHa's methodology for detecting other types of inaccuracies in generated content, such as factual inaccuracies or incorrect object relations, thus broadening its applicability. Additionally, integrating LLMs with more sophisticated object detection frameworks could further enhance hallucination detection capabilities.

Conclusion

ALOHa represents a meaningful advancement in addressing the challenge of object hallucinations in automated image captioning. By leveraging the contextual understanding capabilities of LLMs and introducing a nuanced, flexible approach to hallucination detection, ALOHa sets a new standard for evaluating image captions' accuracy and reliability. It offers a promising path forward for both improving caption generation models and developing more sophisticated evaluation metrics in the vision-language domain.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. nocaps: novel object captioning at scale. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 8947–8956. IEEE.
  2. Spice: Semantic propositional image caption evaluation. In European conference on computer vision, pages 382–398. Springer.
  3. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623.
  4. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  5. Sparks of artificial general intelligence: Early experiments with gpt-4. ArXiv preprint, abs/2303.12712.
  6. End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 213–229. Springer.
  7. Ic3: Image captioning by committee consensus. ArXiv preprint, abs/2302.01328.
  8. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023).
  9. Plausible may not be faithful: Probing object hallucination in vision-language pre-training. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2136–2148, Dubrovnik, Croatia. Association for Computational Linguistics.
  10. Handling divergent reference texts when evaluating table-to-text generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4884–4895, Florence, Italy. Association for Computational Linguistics.
  11. FEQA: A question answering evaluation framework for faithfulness assessment in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5055–5070, Online. Association for Computational Linguistics.
  12. Koala: A dialogue model for academic research. Blog post.
  13. CLIPScore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514–7528, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  14. spacy: Industrial-strength natural language processing in python.
  15. spacy: Industrial-strength natural language processing in python, zenodo, 2020.
  16. Ed-faith: Evaluating dialogue summarization on faithfulness. ArXiv preprint, abs/2211.08464.
  17. Evaluating the factual consistency of abstractive text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9332–9346, Online. Association for Computational Linguistics.
  18. Harold W Kuhn. 1955. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97.
  19. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision, 128(7):1956–1981.
  20. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 12888–12900. PMLR.
  21. Evaluating object hallucination in large vision-language models.
  22. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online. Association for Computational Linguistics.
  23. Advances in pre-training distributed word representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
  24. MetaICL: Learning to learn in context. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2791–2809, Seattle, United States. Association for Computational Linguistics.
  25. OpenAI. 2022. Introducing chatgpt.
  26. OpenAI. 2023. Gpt-4 technical report.
  27. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR.
  28. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
  29. Object hallucination in image captioning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4035–4045, Brussels, Belgium. Association for Computational Linguistics.
  30. FOIL it! find one mismatch between image and language caption. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 255–265, Vancouver, Canada. Association for Computational Linguistics.
  31. Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3784–3803, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  32. Harim+: Evaluating summary quality with hallucination risk: Evaluating summary quality with hallucination risk. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, pages 895–924.
  33. Arvind Krishna Sridhar and Erik Visser. 2022. Improved beam search for hallucination mitigation in abstractive summarization. ArXiv preprint, abs/2212.02712.
  34. David Wan and Mohit Bansal. 2022. Evaluating and improving factuality in multimodal abstractive summarization. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9632–9648, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  35. OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 23318–23340. PMLR.
  36. MSR-VTT: A large video description dataset for bridging video and language. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 5288–5296. IEEE Computer Society.
  37. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78.
  38. Bartscore: Evaluating generated text as text generation. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 27263–27277.
  39. Bertscore: Evaluating text generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  40. Judging llm-as-a-judge with mt-bench and chatbot arena. ArXiv preprint, abs/2306.05685.
  41. Chatgpt asks, blip-2 answers: Automatic questioning towards enriched visual descriptions. ArXiv preprint, abs/2303.06594.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Suzanne Petryk (12 papers)
  2. David M. Chan (30 papers)
  3. Anish Kachinthaya (2 papers)
  4. Haodi Zou (3 papers)
  5. John Canny (44 papers)
  6. Joseph E. Gonzalez (167 papers)
  7. Trevor Darrell (324 papers)
Citations (6)