Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MetaToken: Detecting Hallucination in Image Descriptions by Meta Classification (2405.19186v1)

Published 29 May 2024 in cs.CV, cs.CL, and cs.LG

Abstract: Large Vision LLMs (LVLMs) have shown remarkable capabilities in multimodal tasks like visual question answering or image captioning. However, inconsistencies between the visual information and the generated text, a phenomenon referred to as hallucinations, remain an unsolved problem with regard to the trustworthiness of LVLMs. To address this problem, recent works proposed to incorporate computationally costly Large (Vision) LLMs in order to detect hallucinations on a sentence- or subsentence-level. In this work, we introduce MetaToken, a lightweight binary classifier to detect hallucinations on the token-level at negligible cost. Based on a statistical analysis, we reveal key factors of hallucinations in LVLMs which have been overseen in previous works. MetaToken can be applied to any open-source LVLM without any knowledge about ground truth data providing a reliable detection of hallucinations. We evaluate our method on four state-of-the-art LVLMs demonstrating the effectiveness of our approach.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. Flamingo: A visual language model for few-shot learning. In Advances in Neural Information Processing Systems, pages 23716–23736. Curran Associates, Inc, 2022.
  2. Spice: Semantic propositional image caption evaluation. In Computer Vision – ECCV 2016, pages 382–398, Cham, 2016. Springer International Publishing.
  3. R. Bisiani. Beam search. Encyclopedia of Artificial Intelligence, pages 56–58, 1987.
  4. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2020. Curran Associates Inc.
  5. Confidence scoring using whitebox meta-models with linear classifier probes. In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, pages 1467–1475. PMLR, 2019.
  6. Unified hallucination detection for multimodal large language models. ArXiv, abs/2402.03190, 2024.
  7. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  8. Instructblip: Towards general-purpose vision-language models with instruction tuning. In Advances in Neural Information Processing Systems, pages 49250–49267. Curran Associates, Inc, 2023a.
  9. Plausible may not be faithful: Probing object hallucination in vision-language pre-training. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2136–2148, Dubrovnik, Croatia, 2023b. Association for Computational Linguistics.
  10. The relationship between precision-recall and roc curves. In Proceedings of the 23rd International Conference on Machine Learning, pages 233–240, New York, NY, USA, 2006. Association for Computing Machinery.
  11. A survey of vision-language pre-trained models. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pages 5436–5443. International Joint Conferences on Artificial Intelligence Organization, 2022.
  12. Least angle regression. The Annals of Statistics, 32(2):407–499, 2004.
  13. Temporal performance prediction for deep convolutional long short-term memory networks. In Advanced Analytics and Learning on Temporal Data, pages 145–158, Cham, 2023. Springer Nature Switzerland.
  14. Mme: A comprehensive evaluation benchmark for multimodal large language models. ArXiv, abs/2306.13394, 2023.
  15. Fuxiao Liu and Kevin Lin and Linjie Li and Jianfeng Wang and Yaser Yacoob and Lijuan Wang. Mitigating hallucination in large multi-modal models via robust instruction tuning. ArXiv, abs/2306.14565, 2023.
  16. A survey for foundation models in autonomous driving. ArXiv, abs/2402.01105, 2024.
  17. Multimodal-gpt: A vision and language model for dialogue with humans. ArXiv, abs/2305.04790, 2023.
  18. Conformal alignment: Knowing when to trust foundation models with guarantees. ArXiv, 2024.
  19. Detecting and preventing hallucinations in large vision language models. ArXiv, abs/2308.06394, 2023.
  20. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In Proceedings of International Conference on Learning Representations, 2017.
  21. Clipscore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514–7528, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics.
  22. The curious case of neural text degeneration. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020.
  23. Visual instruction tuning towards general-purpose multimodal model: A survey. ArXiv, abs/2312.16602, 2023.
  24. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the 38th International Conference on Machine Learning, pages 4904–4916. PMLR, 2021.
  25. Evaluating general vision-language models for clinical medicine. medRxiv, 2024.
  26. Faithscore: Evaluating hallucinations in large vision-language models. ArXiv, abs/2311.01477, 2023.
  27. Vilt: Vision-and-language transformer without convolution or region supervision. In Proceedings of the 38th International Conference on Machine Learning, pages 5583–5594. PMLR, 2021.
  28. Yodar: Uncertainty-based sensor fusion for vehicle detection with camera and radar sensors. In International Conference on Agents and Artificial Intelligence, 2020.
  29. Meteor: An automatic metric for mt evaluation with high levels of correlation with human judgments. In Proceedings of the Second Workshop on Statistical Machine Translation, pages 228–231, USA, 2007. Association for Computational Linguistics.
  30. Mitigating object hallucinations in large vision-language models through visual contrastive decoding. ArXiv, abs/2311.16922, 2023.
  31. Vision-language instruction tuning: A review and analysis. ArXiv, abs/2311.08172, 2023a.
  32. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. In Advances in Neural Information Processing Systems, pages 28541–28564. Curran Associates, Inc, 2023b.
  33. Align before fuse: Vision and language representation learning with momentum distillation. In Advances in Neural Information Processing Systems, pages 9694–9705. Curran Associates, Inc, 2021.
  34. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the 39th International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  35. Visualbert: A simple and performant baseline for vision and language. ArXiv, abs/1908.03557, 2019.
  36. Evaluating object hallucination in large vision-language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 292–305, Singapore, 2023c. Association for Computational Linguistics.
  37. Microsoft coco: Common objects in context. In Computer Vision – ECCV 2014, pages 740–755, Cham, 2014. Springer International Publishing.
  38. Meta-classification: Combining multimodal classifiers. In Mining Multimedia and Complex Data, pages 217–231, Berlin, Heidelberg, 2003. Springer Berlin Heidelberg.
  39. Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v(ision), llava-1.5, and other multi-modality models. ArXiv, abs/2310.14566, 2023a.
  40. A survey on hallucination in large vision-language models. ArXiv, abs/2402.00253, 2024.
  41. A token-level reference-free hallucination detection benchmark for free-form text generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6723–6737, Dublin, Ireland, 2022. Association for Computational Linguistics.
  42. Mmbench: Is your multi-modal model an all-around player? ArXiv, abs/2307.06281, 2023b.
  43. Negative object presence evaluation (nope) to measure object hallucination in vision-language models. ArXiv, abs/2310.05338, 2023.
  44. Neural baby talk. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7219–7228, 2018.
  45. Time-dynamic estimates of the reliability of deep semantic segmentation networks. In 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI), pages 502–509, 2020.
  46. Improving video instance segmentation by light-weight temporal uncertainty estimates. In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–8, 2021.
  47. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9004–9017, Singapore, 2023. Association for Computational Linguistics.
  48. Obtaining well calibrated probabilities using bayesian binning. Proceedings of the AAAI Conference on Artificial Intelligence, 29(1), 2015.
  49. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, 2002. Association for Computational Linguistics.
  50. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  51. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020.
  52. Object hallucination in image captioning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4035–4045, Brussels, Belgium, 2018. Association for Computational Linguistics.
  53. Uncertainty measures and prediction quality rating for the semantic segmentation of nested multi resolution street scene images. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1361–1369, 2019.
  54. Prediction error meta classification in semantic segmentation: Detection via aggregated dispersion measures of softmax probabilities. In 2020 International Joint Conference on Neural Networks (IJCNN), pages 1–9, 2020.
  55. Metadetect: Uncertainty quantification and prediction quality estimates for object detection. In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–10, 2021.
  56. C. E. Shannon. A mathematical theory of communication. The Bell System Technical Journal, 27(3):379–423, 1948.
  57. Drivevlm: The convergence of autonomous driving and large vision-language models. ArXiv, abs/2402.12289, 2024.
  58. Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 2018.
  59. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023.
  60. Towards better confidence estimation for neural models. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7335–7339, 2019.
  61. Cider: Consensus-based image description evaluation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4566–4575, 2015.
  62. Evaluation and analysis of hallucination in large vision-language models. ArXiv, abs/2308.15126, 2023.
  63. Mitigating fine-grained hallucination by fine-tuning large vision-language models with caption rewrites. In MultiMedia Modeling, pages 32–45, Cham, 2024. Springer Nature Switzerland.
  64. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In Proceedings of the 39th International Conference on Machine Learning, pages 23318–23340. PMLR, 2022.
  65. Logical closed loop: Uncovering object hallucinations in large vision-language models. ArXiv, abs/2402.11622, 2024.
  66. Efuf: Efficient fine-grained unlearning framework for mitigating hallucinations in multimodal large language models. ArXiv, abs/2402.09801, 2024.
  67. mplug-owl: Modularization empowers large language models with multimodality. ArXiv, abs/2304.14178, 2023.
  68. Woodpecker: Hallucination correction for multimodal large language models. ArXiv, abs/2310.16045, 2023.
  69. Mm-vet: Evaluating large multimodal models for integrated capabilities. ArXiv, abs/2308.02490, 2023.
  70. Mitigating object hallucination in large vision-language models via classifier-free guidance. ArXiv, abs/2402.08680, 2024.
  71. A survey of large language models. ArXiv, abs/2303.18223, 2023.
  72. Analyzing and mitigating object hallucination in large vision-language models. ArXiv, abs/2310.00754, 2023.
  73. Minigpt-4: Enhancing vision-language understanding with advanced large language models. ArXiv, abs/2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Laura Fieback (3 papers)
  2. Jakob Spiegelberg (9 papers)
  3. Hanno Gottschalk (90 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets