Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Wolf Within: Covert Injection of Malice into MLLM Societies via an MLLM Operative (2402.14859v2)

Published 20 Feb 2024 in cs.CR, cs.AI, cs.CY, and cs.LG

Abstract: Due to their unprecedented ability to process and respond to various types of data, Multimodal LLMs (MLLMs) are constantly defining the new boundary of AGI. As these advanced generative models increasingly form collaborative networks for complex tasks, the integrity and security of these systems are crucial. Our paper, The Wolf Within'', explores a novel vulnerability in MLLM societies - the indirect propagation of malicious content. Unlike direct harmful output generation for MLLMs, our research demonstrates how a single MLLM agent can be subtly influenced to generate prompts that, in turn, induce other MLLM agents in the society to output malicious content. Our findings reveal that, an MLLM agent, when manipulated to produce specific prompts or instructions, can effectivelyinfect'' other agents within a society of MLLMs. This infection leads to the generation and circulation of harmful outputs, such as dangerous instructions or misinformation, across the society. We also show the transferability of these indirectly generated prompts, highlighting their possibility in propagating malice through inter-agent communication. This research provides a critical insight into a new dimension of threat posed by MLLMs, where a single agent can act as a catalyst for widespread malevolent influence. Our work underscores the urgent need for developing robust mechanisms to detect and mitigate such covert manipulations within MLLM societies, ensuring their safe and ethical utilization in societal applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. (ab) using images and sounds for indirect instruction injection in multi-modal llms. arXiv preprint arXiv:2307.10490, 2023.
  2. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
  3. Image hijacking: Adversarial images can control generative models at runtime. arXiv e-prints, pp.  arXiv–2309, 2023.
  4. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023a.
  5. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors in agents. arXiv preprint arXiv:2308.10848, 2023b.
  6. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  7. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
  8. Generalized gumbel-softmax gradient estimator for various discrete random variables. arXiv preprint arXiv:2003.01847, 2020.
  9. Generating images with multimodal language models. arXiv preprint arXiv:2305.17216, 2023.
  10. Camel: Communicative agents for” mind” exploration of large scale language model society. arXiv preprint arXiv:2303.17760, 2023.
  11. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023a.
  12. Training socially aligned language models in simulated human society. arXiv preprint arXiv:2305.16960, 2023b.
  13. Safety of multimodal large language models on images and text. arXiv preprint arXiv:2402.00357, 2024.
  14. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860, 2023c.
  15. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018.
  16. Jailbreaking attack against multimodal large language model. arXiv preprint arXiv:2402.02309, 2024.
  17. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pp.  1–22, 2023.
  18. Visual adversarial examples jailbreak aligned large language models. In The Second Workshop on New Frontiers in Adversarial Machine Learning, volume 1, 2023.
  19. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  20. Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models. arXiv preprint arXiv:2307.14539, 2023.
  21. Certified defenses for data poisoning attacks. Advances in neural information processing systems, 30, 2017.
  22. Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355, 2023.
  23. Poisoning language models during instruction tuning. arXiv preprint arXiv:2305.00944, 2023.
  24. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023.
  25. Vlattack: Multimodal adversarial attacks on vision-language tasks via pre-trained models. arXiv preprint arXiv:2310.04655, 2023.
  26. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. arXiv preprint arXiv:2401.06373, 2024.
  27. Exploring collaboration mechanisms for llm agents: A social psychology view. arXiv preprint arXiv:2310.02124, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Zhen Tan (68 papers)
  2. Chengshuai Zhao (8 papers)
  3. Raha Moraffah (25 papers)
  4. Yifan Li (106 papers)
  5. Yu Kong (37 papers)
  6. Tianlong Chen (202 papers)
  7. Huan Liu (283 papers)
Citations (10)
X Twitter Logo Streamline Icon: https://streamlinehq.com