How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts (2402.13220v2)
Abstract: The remarkable advancements in Multimodal LLMs (MLLMs) have not rendered them immune to challenges, particularly in the context of handling deceptive information in prompts, thus producing hallucinated responses under such conditions. To quantitatively assess this vulnerability, we present MAD-Bench, a carefully curated benchmark that contains 1000 test samples divided into 5 categories, such as non-existent objects, count of objects, and spatial relationship. We provide a comprehensive analysis of popular MLLMs, ranging from GPT-4v, Reka, Gemini-Pro, to open-sourced models, such as LLaVA-NeXT and MiniCPM-Llama3. Empirically, we observe significant performance gaps between GPT-4o and other models; and previous robust instruction-tuned models are not effective on this new benchmark. While GPT-4o achieves 82.82% accuracy on MAD-Bench, the accuracy of any other model in our experiments ranges from 9% to 50%. We further propose a remedy that adds an additional paragraph to the deceptive prompts to encourage models to think twice before answering the question. Surprisingly, this simple method can even double the accuracy; however, the absolute numbers are still too low to be satisfactory. We hope MAD-Bench can serve as a valuable benchmark to stimulate further research to enhance model resilience against deceptive prompts.
- Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198.
- Openflamingo.
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966.
- Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966.
- Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195.
- Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793.
- Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565.
- Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794.
- Uprise: Universal prompt retrieval for improving zero-shot evaluation. In EMNLP.
- Dola: Decoding by contrasting layers improves factuality in large language models. In ICLR.
- Holistic analysis of hallucination in gpt-4v (ision): Bias and interference challenges. arXiv preprint arXiv:2311.03287.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500.
- PaLM-E: An embodied multimodal language model. arXiv preprint arXiv:2303.03378.
- Halo: Estimation and reduction of hallucinations in open-source weak large language models. arXiv preprint arXiv:2308.11764v4.
- Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394.
- Guiding instruction-based image editing via multimodal large language models. arXiv preprint arXiv:2309.17102.
- Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models. arXiv preprint arXiv:2310.14566.
- Detecting and preventing hallucinations in large vision language models. In AAAI.
- Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045.
- Towards mitigating LLM hallucination via self reflection. In Findings of EMNLP.
- Teaching language models to hallucinate less with synthetic tasks. arXiv preprint arXiv:2310.06827v3.
- Generating images with multimodal language models. arXiv preprint arXiv:2305.17216.
- Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692.
- Obelisc: An open web-scale filtered dataset of interleaved image-text documents. arXiv preprint arXiv:2306.16527.
- Volcano: Mitigating multimodal hallucination through self-feedback guided revision. arXiv preprint arXiv:2311.07362.
- Mitigating object hallucinations in large vision-language models through visual contrastive decoding.
- Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726.
- Multimodal foundation models: From specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
- Inference-time intervention: Eliciting truthful answers from a language model. In NeurIPS.
- Evaluating object hallucination in large vision-language models. In EMNLP.
- A comprehensive study of gpt-4v’s multimodal capabilities in medical imaging. medRxiv.
- Microsoft coco: Common objects in context. In ECCV.
- Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575.
- Mitigating hallucination in large multi-modal models via robust instruction tuning. In ICLR.
- Improved baselines with visual instruction tuning. In NeurIPS.
- Visual instruction tuning. In NeurIPS.
- Trustworthy llms: a survey and guideline for evaluating large language models’ alignment. arXiv preprint arXiv:2308.05374.
- Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281v3.
- Holistic evaluation of gpt-4v for biomedical imaging. arXiv preprint arXiv:2312.05256.
- Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation. arXiv preprint arXiv:2305.15852.
- OpenAI. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824.
- Detecting and mitigating hallucinations in multilingual summarisation. In EMNLP.
- Towards understanding sycophancy in language models. arXiv preprint arXiv:2310.13548.
- Trusting your evidence: Hallucinate less with context-aware decoding. arXiv preprint arXiv:2305.14739.
- Prompting gpt-3 to be reliable. arXiv preprint arXiv:2210.09150v2.
- Generative multimodal models are in-context learners. arXiv preprint arXiv:2312.13286.
- Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525.
- Gemini Team. 2023. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- Fine-tuning language models for factuality. In ICLR.
- Freshllms: Refreshing large language models with search engine augmentation. arXiv preprint arXiv:2310.03214.
- Vigc: Visual instruction generation and correction. arXiv preprint arXiv:2308.12714v2.
- Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100.
- An llm-free multi-dimensional benchmark for mllms hallucination evaluation. arXiv preprint arXiv:2312.11805.
- Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079.
- Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175.
- Simvlm: Simple visual language model pretraining with weak supervision. In ICLR.
- The dawn of lmms: Preliminary explorations with gpt-4v(ision). arXiv preprint arXiv:2309.17421.
- mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178.
- mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257.
- Woodpecker: Hallucination correction for multimodal large language models. arXiv preprint arXiv:2310.16045.
- Ferret: Refer and ground anything anywhere at any granularity. In ICLR.
- Halle-switch: Rethinking and controlling object existence hallucinations in large vision language models for detailed caption. arXiv preprint arXiv:2310.01779.
- Llava-grounding: Grounded visual chat with large multimodal models. arXiv preprint arXiv:2312.02949.
- Exploring recommendation capabilities of gpt-4v(ision): A preliminary case study. arXiv preprint arXiv:2311.04199.
- Analyzing and mitigating object hallucination in large vision-language models. In ICLR.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. In ICLR.
- Multimodal c4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939.
Collections
Sign up for free to add this paper to one or more collections.