Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Adversarial Robustness for Visual Grounding of Multimodal Large Language Models (2405.09981v1)

Published 16 May 2024 in cs.CV

Abstract: Multi-modal LLMs (MLLMs) have recently achieved enhanced performance across various vision-language tasks including visual grounding capabilities. However, the adversarial robustness of visual grounding remains unexplored in MLLMs. To fill this gap, we use referring expression comprehension (REC) as an example task in visual grounding and propose three adversarial attack paradigms as follows. Firstly, untargeted adversarial attacks induce MLLMs to generate incorrect bounding boxes for each object. Besides, exclusive targeted adversarial attacks cause all generated outputs to the same target bounding box. In addition, permuted targeted adversarial attacks aim to permute all bounding boxes among different objects within a single image. Extensive experiments demonstrate that the proposed methods can successfully attack visual grounding capabilities of MLLMs. Our methods not only provide a new perspective for designing novel attacks but also serve as a strong baseline for improving the adversarial robustness for visual grounding of MLLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
  2. (ab) using images and sounds for indirect instruction injection in multi-modal llms. arXiv preprint arXiv:2307.10490, 2023.
  3. Targeted attack for deep hashing based retrieval. In ECCV, 2020a.
  4. Hardly perceptible trojan attack against neural networks with bit flips. In ECCV, 2022a.
  5. Targeted attack against deep neural networks via flipping limited weight bits. In ICLR, 2022b.
  6. Badclip: Trigger-aware prompt learning for backdoor attacks on clip. arXiv preprint arXiv:2311.16194, 2023a.
  7. Versatile weight attack via flipping limited bits. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023b.
  8. Improving query efficiency of black-box adversarial attack. In ECCV, 2020b.
  9. Improving adversarial robustness via channel-wise activation suppressing. In ICLR, 2021.
  10. On evaluating adversarial robustness. arXiv preprint arXiv:1902.06705, 2019.
  11. Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023.
  12. Visualgpt: Data-efficient adaptation of pretrained language models for image captioning. In CVPR, 2022.
  13. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023a.
  14. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023b.
  15. Boosting adversarial attacks with momentum. In CVPR, 2018.
  16. How robust is google’s bard to adversarial image attacks? arXiv preprint arXiv:2309.11751, 2023.
  17. Imperceptible and robust backdoor attack in 3d point cloud. IEEE Transactions on Information Forensics and Security, 19:1267–1282, 2023a.
  18. Backdoor defense via adaptively splitting poisoned dataset. In CVPR, 2023b.
  19. Inducing high energy-latency of large vision-language models with verbose images. In ICLR, 2024a.
  20. Energy-latency manipulation of multi-modal large language models via verbose samples. arXiv preprint arXiv:2404.16557, 2024b.
  21. Explaining and harnessing adversarial examples. In ICLR, 2015.
  22. Black-box adversarial attacks with limited queries and information. In ICML, 2018.
  23. Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, 2014.
  24. Align before fuse: Vision and language representation learning with momentum distillation. In NeurIPS, 2021.
  25. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022a.
  26. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023.
  27. Referring transformer: A one-step approach to multi-task visual grounding. In NeurIPS, 2021.
  28. Backdoor learning: A survey. IEEE Transactions on Neural Networks and Learning Systems, 2022b.
  29. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  30. Visual knowledge graph for human action reasoning in videos. In ACM MM, 2022a.
  31. Simvtp: Simple video text pre-training with masked autoencoders. arXiv preprint arXiv:2212.03490, 2022b.
  32. Follow your pose: Pose-guided text-to-video generation using pose-free videos. In AAAI, 2024.
  33. Towards deep learning models resistant to adversarial attacks. In ICLR, 2018.
  34. Generation and comprehension of unambiguous object descriptions. In CVPR, 2016.
  35. OpenAI. Gpt-4 technical report. 2023.
  36. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
  37. Visual adversarial examples jailbreak large language models. arXiv preprint arXiv:2306.13213, 2023.
  38. Tbt: Targeted neural network attack with bit trojan. In CVPR, 2020.
  39. Poison frogs! targeted clean-label poisoning attacks on neural networks. In NeurIPS, 2018.
  40. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. In NeurIPS, 2024a.
  41. Stop reasoning! when multimodal llms with chain-of-thought reasoning meets adversarial images. arXiv preprint arXiv:2402.14899, 2024b.
  42. Cheating suffix: Targeted attack to text-to-image diffusion models with multi-modal priors. arXiv preprint arXiv:2402.01369, 2024.
  43. Modeling context in referring expressions. In ECCV, 2016.
  44. Grounding referring expressions in images by variational context. In CVPR, 2018.
  45. Theoretically principled trade-off between robustness and accuracy. In ICML, 2019.
  46. On evaluating adversarial robustness of large vision-language models. arXiv preprint arXiv:2305.16934, 2023.
  47. Minigpt-4: Enhancing vision-language understanding with advanced large language models. 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Kuofeng Gao (23 papers)
  2. Yang Bai (204 papers)
  3. Jiawang Bai (23 papers)
  4. Yong Yang (237 papers)
  5. Shu-Tao Xia (171 papers)
Citations (11)
X Twitter Logo Streamline Icon: https://streamlinehq.com