Papers
Topics
Authors
Recent
Search
2000 character limit reached

Mitigating Open-Vocabulary Caption Hallucinations

Published 6 Dec 2023 in cs.CV and cs.AI | (2312.03631v4)

Abstract: While recent years have seen rapid progress in image-conditioned text generation, image captioning still suffers from the fundamental issue of hallucinations, namely, the generation of spurious details that cannot be inferred from the given image. Existing methods largely use closed-vocabulary object lists to mitigate or evaluate hallucinations in image captioning, ignoring the long-tailed nature of hallucinations that occur in practice. To this end, we propose a framework for addressing hallucinations in image captioning in the open-vocabulary setting. Our framework includes a new benchmark, OpenCHAIR, that leverages generative foundation models to evaluate open-vocabulary object hallucinations for image captioning, surpassing the popular and similarly-sized CHAIR benchmark in both diversity and accuracy. Furthermore, to mitigate open-vocabulary hallucinations without using a closed object list, we propose MOCHa, an approach harnessing advancements in reinforcement learning. Our multi-objective reward function explicitly targets the trade-off between fidelity and adequacy in generations without requiring any strong supervision. MOCHa improves a large variety of image captioning models, as captured by our OpenCHAIR benchmark and other existing metrics. Code and models can be found at: https://github.com/assafbk/mocha_code

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Improving multimodal interactive agents with reinforcement learning from human feedback. arXiv preprint arXiv:2211.11602, 2022.
  2. Let there be a clock on the beach: Reducing object hallucination in image captioning, 2021.
  3. Concreteness ratings for 40 thousand generally known english word lemmas. Behavior research methods, 46:904–911, 2014.
  4. The evolution and impact of human confidence in artificial intelligence and in themselves on ai-assisted decision-making in design. Journal of Mechanical Design, 145(3):031401, 2023.
  5. Human confidence in artificial intelligence and in themselves: The evolution and impact of confidence on adoption of ai advice. Computers in Human Behavior, 127:107018, 2022.
  6. Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text. Journal of Artificial Intelligence Research, 77:103–166, 2023.
  7. Detecting and preventing hallucinations in large vision language models. arXiv preprint arXiv:2308.06394, 2023.
  8. Clipscore: A reference-free evaluation metric for image captioning, 2022.
  9. Quantifying the visual concreteness of words and topics in multimodal datasets. arXiv preprint arXiv:1804.06786, 2018.
  10. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019.
  11. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear, 2017.
  12. Lora: Low-rank adaptation of large language models, 2021.
  13. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv preprint arXiv:1907.00456, 2019.
  14. Sequence tutor: Conservative fine-tuning of sequence generation models with kl-control. In International Conference on Machine Learning, pages 1645–1654. PMLR, 2017.
  15. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
  16. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137, 2015.
  17. Evaluating the factual consistency of abstractive text summarization. arXiv preprint arXiv:1910.12840, 2019.
  18. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. CoRR, abs/1910.13461, 2019.
  19. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  20. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  21. Evaluating object hallucination in large vision-language models, 2023.
  22. Comprehending and ordering semantics for image captioning, 2022.
  23. Microsoft coco: Common objects in context, 2015.
  24. Show, deconfound and tell: Image captioning with causal inference. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18020–18029, 2022.
  25. Chatgpt and bard exhibit spontaneous citation fabrication during psychiatry literature search. Psychiatry Research, 326:115334, 2023.
  26. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  27. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
  28. Simple token-level confidence improves caption correctness, 2023.
  29. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  30. Self-critical sequence training for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7008–7024, 2017.
  31. Object hallucination in image captioning. CoRR, abs/1809.02156, 2018.
  32. Factually consistent summarization via reinforcement learning with textual entailment feedback. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6252–6272, Toronto, Canada, July 2023. Association for Computational Linguistics.
  33. Proximal policy optimization algorithms, 2017.
  34. Emscore: Evaluating video captioning via coarse-grained and fine-grained embedding matching, 2022.
  35. From show to tell: A survey on deep learning-based image captioning. IEEE transactions on pattern analysis and machine intelligence, 45(1):539–559, 2022.
  36. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  37. Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525, 2023.
  38. Aligning large multimodal models with factually augmented rlhf, 2023.
  39. Salmon: Self-alignment with principle-following reward models, 2023.
  40. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018.
  41. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  42. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022.
  43. Faier: Fidelity and adequacy ensured image caption evaluation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14050–14059, 2021.
  44. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122. Association for Computational Linguistics, 2018.
  45. Challenges of zero-shot recognition with vision-language models: Granularity and correctness. arXiv preprint arXiv:2306.16048, 2023.
  46. Woodpecker: Hallucination correction for multimodal large language models, 2023.
  47. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
  48. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
  49. Fine-tuning language models from human preferences, 2020.
Citations (2)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.