Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
Gemini 2.5 Pro Premium
51 tokens/sec
GPT-5 Medium
34 tokens/sec
GPT-5 High Premium
28 tokens/sec
GPT-4o
115 tokens/sec
DeepSeek R1 via Azure Premium
91 tokens/sec
GPT OSS 120B via Groq Premium
453 tokens/sec
Kimi K2 via Groq Premium
140 tokens/sec
2000 character limit reached

How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts (2402.13220v2)

Published 20 Feb 2024 in cs.CV and cs.CL

Abstract: The remarkable advancements in Multimodal LLMs (MLLMs) have not rendered them immune to challenges, particularly in the context of handling deceptive information in prompts, thus producing hallucinated responses under such conditions. To quantitatively assess this vulnerability, we present MAD-Bench, a carefully curated benchmark that contains 1000 test samples divided into 5 categories, such as non-existent objects, count of objects, and spatial relationship. We provide a comprehensive analysis of popular MLLMs, ranging from GPT-4v, Reka, Gemini-Pro, to open-sourced models, such as LLaVA-NeXT and MiniCPM-Llama3. Empirically, we observe significant performance gaps between GPT-4o and other models; and previous robust instruction-tuned models are not effective on this new benchmark. While GPT-4o achieves 82.82% accuracy on MAD-Bench, the accuracy of any other model in our experiments ranges from 9% to 50%. We further propose a remedy that adds an additional paragraph to the deceptive prompts to encourage models to think twice before answering the question. Surprisingly, this simple method can even double the accuracy; however, the absolute numbers are still too low to be satisfactory. We hope MAD-Bench can serve as a valuable benchmark to stimulate further research to enhance model resilience against deceptive prompts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198.
  2. Openflamingo.
  3. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966.
  4. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966.
  5. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195.
  6. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793.
  7. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565.
  8. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794.
  9. Uprise: Universal prompt retrieval for improving zero-shot evaluation. In EMNLP.
  10. Dola: Decoding by contrasting layers improves factuality in large language models. In ICLR.
  11. Holistic analysis of hallucination in gpt-4v (ision): Bias and interference challenges. arXiv preprint arXiv:2311.03287.
  12. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500.
  13. PaLM-E: An embodied multimodal language model. arXiv preprint arXiv:2303.03378.
  14. Halo: Estimation and reduction of hallucinations in open-source weak large language models. arXiv preprint arXiv:2308.11764v4.
  15. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394.
  16. Guiding instruction-based image editing via multimodal large language models. arXiv preprint arXiv:2309.17102.
  17. Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models. arXiv preprint arXiv:2310.14566.
  18. Detecting and preventing hallucinations in large vision language models. In AAAI.
  19. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045.
  20. Towards mitigating LLM hallucination via self reflection. In Findings of EMNLP.
  21. Teaching language models to hallucinate less with synthetic tasks. arXiv preprint arXiv:2310.06827v3.
  22. Generating images with multimodal language models. arXiv preprint arXiv:2305.17216.
  23. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692.
  24. Obelisc: An open web-scale filtered dataset of interleaved image-text documents. arXiv preprint arXiv:2306.16527.
  25. Volcano: Mitigating multimodal hallucination through self-feedback guided revision. arXiv preprint arXiv:2311.07362.
  26. Mitigating object hallucinations in large vision-language models through visual contrastive decoding.
  27. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726.
  28. Multimodal foundation models: From specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020.
  29. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
  30. Inference-time intervention: Eliciting truthful answers from a language model. In NeurIPS.
  31. Evaluating object hallucination in large vision-language models. In EMNLP.
  32. A comprehensive study of gpt-4v’s multimodal capabilities in medical imaging. medRxiv.
  33. Microsoft coco: Common objects in context. In ECCV.
  34. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575.
  35. Mitigating hallucination in large multi-modal models via robust instruction tuning. In ICLR.
  36. Improved baselines with visual instruction tuning. In NeurIPS.
  37. Visual instruction tuning. In NeurIPS.
  38. Trustworthy llms: a survey and guideline for evaluating large language models’ alignment. arXiv preprint arXiv:2308.05374.
  39. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281v3.
  40. Holistic evaluation of gpt-4v for biomedical imaging. arXiv preprint arXiv:2312.05256.
  41. Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation. arXiv preprint arXiv:2305.15852.
  42. OpenAI. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  43. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824.
  44. Detecting and mitigating hallucinations in multilingual summarisation. In EMNLP.
  45. Towards understanding sycophancy in language models. arXiv preprint arXiv:2310.13548.
  46. Trusting your evidence: Hallucinate less with context-aware decoding. arXiv preprint arXiv:2305.14739.
  47. Prompting gpt-3 to be reliable. arXiv preprint arXiv:2210.09150v2.
  48. Generative multimodal models are in-context learners. arXiv preprint arXiv:2312.13286.
  49. Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525.
  50. Gemini Team. 2023. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  51. Fine-tuning language models for factuality. In ICLR.
  52. Freshllms: Refreshing large language models with search engine augmentation. arXiv preprint arXiv:2310.03214.
  53. Vigc: Visual instruction generation and correction. arXiv preprint arXiv:2308.12714v2.
  54. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100.
  55. An llm-free multi-dimensional benchmark for mllms hallucination evaluation. arXiv preprint arXiv:2312.11805.
  56. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079.
  57. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175.
  58. Simvlm: Simple visual language model pretraining with weak supervision. In ICLR.
  59. The dawn of lmms: Preliminary explorations with gpt-4v(ision). arXiv preprint arXiv:2309.17421.
  60. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178.
  61. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257.
  62. Woodpecker: Hallucination correction for multimodal large language models. arXiv preprint arXiv:2310.16045.
  63. Ferret: Refer and ground anything anywhere at any granularity. In ICLR.
  64. Halle-switch: Rethinking and controlling object existence hallucinations in large vision language models for detailed caption. arXiv preprint arXiv:2310.01779.
  65. Llava-grounding: Grounded visual chat with large multimodal models. arXiv preprint arXiv:2312.02949.
  66. Exploring recommendation capabilities of gpt-4v(ision): A preliminary case study. arXiv preprint arXiv:2311.04199.
  67. Analyzing and mitigating object hallucination in large vision-language models. In ICLR.
  68. Minigpt-4: Enhancing vision-language understanding with advanced large language models. In ICLR.
  69. Multimodal c4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939.
Citations (23)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces MAD-Bench, a benchmark of 850 image-prompt pairs to assess MLLMs' susceptibility to hallucinations from deceptive prompts.
  • The paper demonstrates that GPT-4V achieves 75.02% accuracy while other models range from 5% to 35%, highlighting critical performance gaps.
  • The paper proposes a context-augmentation strategy that improves model responses, though absolute performance remains unsatisfactory.

Evaluating Multimodal LLMs with MAD-Bench

The paper “How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts” presents an empirical analysis of the robustness of Multimodal LLMs (MLLMs) when exposed to deceptive prompts. This research fills a notable gap by examining how these sophisticated AI models manage inconsistencies between textual and visual information, a challenge not comprehensively studied before.

Research Objective and Methodology

The primary aim of the paper is to evaluate the susceptibility of various MLLMs to hallucinations induced by deceptive prompts. The researchers developed MAD-Bench, a new benchmark designed explicitly for this purpose, comprising 850 image-prompt pairs across six deception categories: non-existent objects, object count, object attribute, scene understanding, spatial relationship, and visual confusion. The importance of this work lies in assessing the practical reliability and theoretical soundness of these advanced models.

This benchmark is employed to evaluate several state-of-the-art MLLMs, including GPT-4V, Gemini-Pro, and several open-sourced models such as LLaVA-1.5 and CogVLM. The models' responses were scrutinized for accuracy and resilience against misleading information. Additionally, a novel mitigation strategy was proposed and tested, involving the addition of an introductory paragraph to deceptive prompts, encouraging the models to reassess their responses.

Key Findings

The findings reveal a significant disparity in performance across different models, with GPT-4V achieving an average accuracy of 75.02% on MAD-Bench and other models' accuracy ranging from 5% to 35%. This discrepancy highlights GPT-4V's relative robustness in handling deceptive information compared to its contemporaries. Notably, previous robust instruction-tuned models like LRV-Instruction and LLaVA-RLHF failed to perform effectively on this new benchmark, underscoring the insufficiency of existing training paradigms in mitigating hallucinations induced by deceptive prompts.

One particularly insightful aspect of the paper was the proposed remedial strategy, which involved augmenting prompts with additional context to encourage model deliberation. While this approach significantly improved accuracy, sometimes doubling it, the absolute performance remained unsatisfactory, indicating the complexity of the challenge posed by deceptive prompts.

Implications and Future Directions

The implications of this research are multifold. Practically, it provides a new measurement tool for enhancing the reliability of MLLMs in real-world applications. Theoretically, it emphasizes the need for systematic investigation into the cognitive processes of MLLMs when encountering deceptive information, particularly in developing mechanisms to ensure model accountability and trustworthiness.

For future research, the paper suggests several promising avenues: augmenting training data with deceptive prompts to build more resilient models, improving cross-modal consistency checks, and refining the models' attention and reasoning faculties to prioritize factual alignment over speculative assumptions.

In conclusion, this paper by Yusu Qian et al. exemplifies a rigorous investigation into the pitfalls of MLLMs under deceptive conditions and sets a foundation for developing more robust and reliable multimodal AI systems. The introduction of MAD-Bench is a substantial contribution to the field, promising to catalyze further research into enhancing the resilience of emerging AI technologies.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.