Learn to Disguise: Avoid Refusal Responses in LLM's Defense via a Multi-agent Attacker-Disguiser Game (2404.02532v1)
Abstract: With the enhanced performance of large models on natural language processing tasks, potential moral and ethical issues of large models arise. There exist malicious attackers who induce large models to jailbreak and generate information containing illegal, privacy-invasive information through techniques such as prompt engineering. As a result, large models counter malicious attackers' attacks using techniques such as safety alignment. However, the strong defense mechanism of the large model through rejection replies is easily identified by attackers and used to strengthen attackers' capabilities. In this paper, we propose a multi-agent attacker-disguiser game approach to achieve a weak defense mechanism that allows the large model to both safely reply to the attacker and hide the defense intent. First, we construct a multi-agent framework to simulate attack and defense scenarios, playing different roles to be responsible for attack, disguise, safety evaluation, and disguise evaluation tasks. After that, we design attack and disguise game algorithms to optimize the game strategies of the attacker and the disguiser and use the curriculum learning process to strengthen the capabilities of the agents. The experiments verify that the method in this paper is more effective in strengthening the model's ability to disguise the defense intent compared with other methods. Moreover, our approach can adapt any black-box large model to assist the model in defense and does not suffer from model version iterations.
- J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
- T. Shen, R. Jin, Y. Huang, C. Liu, W. Dong, Z. Guo, X. Wu, Y. Liu, and D. Xiong, “Large language model alignment: A survey,” arXiv preprint arXiv:2309.15025, 2023.
- D. Kang, X. Li, I. Stoica, C. Guestrin, M. Zaharia, and T. Hashimoto, “Exploiting programmatic behavior of llms: Dual-use through standard security attacks,” arXiv preprint arXiv:2302.05733, 2023.
- Y. Xie, J. Yi, J. Shao, J. Curl, L. Lyu, Q. Chen, X. Xie, and F. Wu, “Defending chatgpt against jailbreak attack via self-reminders,” Nature Machine Intelligence, vol. 5, no. 12, pp. 1486–1496, 2023.
- Y. Liu, Y. Jia, R. Geng, J. Jia, and N. Z. Gong, “Prompt injection attacks and defenses in llm-integrated applications,” arXiv preprint arXiv:2310.12815, 2023.
- M. Pisano, P. Ly, A. Sanders, B. Yao, D. Wang, T. Strzalkowski, and M. Si, “Bergeron: Combating adversarial attacks through a conscience-based alignment framework,” arXiv preprint arXiv:2312.00029, 2023.
- G. Deng, Y. Liu, Y. Li, K. Wang, Y. Zhang, Z. Li, H. Wang, T. Zhang, and Y. Liu, “Jailbreaker: Automated jailbreak across multiple large language model chatbots,” arXiv preprint arXiv:2307.08715, 2023.
- B. Cao, Y. Cao, L. Lin, and J. Chen, “Defending against alignment-breaking attacks via robustly aligned llm,” arXiv preprint arXiv:2309.14348, 2023.
- Z. Zhang, J. Yang, P. Ke, and M. Huang, “Defending large language models against jailbreaking attacks through goal prioritization,” arXiv preprint arXiv:2311.09096, 2023.
- Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” in Proceedings of the 26th Annual International Conference on Machine Learning, ser. ICML ’09. New York, NY, USA: Association for Computing Machinery, 2009, p. 41–48. [Online]. Available: https://doi.org/10.1145/1553374.1553380
- B. Deng, W. Wang, F. Feng, Y. Deng, Q. Wang, and X. He, “Attack prompt generation for red teaming and defending large language models,” arXiv preprint arXiv:2310.12505, 2023.
- S. Ge, C. Zhou, R. Hou, M. Khabsa, Y.-C. Wang, Q. Wang, J. Han, and Y. Mao, “Mart: Improving llm safety with multi-round automatic red-teaming,” arXiv preprint arXiv:2311.07689, 2023.
- R. Bhardwaj and S. Poria, “Red-teaming large language models using chain of utterances for safety-alignment,” arXiv preprint arXiv:2308.09662, 2023.
- W. Wang, Z. Tu, C. Chen, Y. Yuan, J.-t. Huang, W. Jiao, and M. R. Lyu, “All languages matter: On the multilingual safety of large language models,” arXiv preprint arXiv:2310.00905, 2023.
- M. L. Littman, “Markov games as a framework for multi-agent reinforcement learning,” in Machine learning proceedings 1994. Elsevier, 1994, pp. 157–163.
- X. Wang, Y. Chen, and W. Zhu, “A survey on curriculum learning,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 9, pp. 4555–4576, 2021.
- Y. Xie, J. Yi, J. Shao, J. Curl, L. Lyu, Q. Chen, X. Xie, and F. Wu, “Defending chatgpt against jailbreak attack via self-reminders,” Nat. Mac. Intell., vol. 5, no. 12, pp. 1486–1496, 2023. [Online]. Available: https://doi.org/10.1038/s42256-023-00765-8
- Y. Liu, Y. Jia, R. Geng, J. Jia, and N. Z. Gong, “Prompt injection attacks and defenses in llm-integrated applications,” CoRR, vol. abs/2310.12815, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.12815
- A. Kumar, C. Agarwal, S. Srinivas, S. Feizi, and H. Lakkaraju, “Certifying LLM safety against adversarial prompting,” CoRR, vol. abs/2309.02705, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2309.02705
- T. Schick, S. Udupa, and H. Schütze, “Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in NLP,” Trans. Assoc. Comput. Linguistics, vol. 9, pp. 1408–1424, 2021. [Online]. Available: https://doi.org/10.1162/tacl_a_00434
- J. Piet, M. Alrashed, C. Sitawarin, S. Chen, Z. Wei, E. Sun, B. Alomair, and D. A. Wagner, “Jatmo: Prompt injection defense by task-specific finetuning,” CoRR, vol. abs/2312.17673, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2312.17673
- B. Deng, W. Wang, F. Feng, Y. Deng, Q. Wang, and X. He, “Attack prompt generation for red teaming and defending large language models,” in Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, H. Bouamor, J. Pino, and K. Bali, Eds. Association for Computational Linguistics, 2023, pp. 2176–2189. [Online]. Available: https://aclanthology.org/2023.findings-emnlp.143
- J. Zeng, J. Xu, X. Zheng, and X. Huang, “Certified robustness to text adversarial attacks by randomized [MASK],” Comput. Linguistics, vol. 49, no. 2, pp. 395–427, 2023. [Online]. Available: https://doi.org/10.1162/coli_a_00476
- D. Ganguli, A. Askell, N. Schiefer, T. I. Liao, K. Lukosiute, A. Chen, A. Goldie, A. Mirhoseini, C. Olsson, D. Hernandez, D. Drain, D. Li, E. Tran-Johnson, E. Perez, J. Kernion, J. Kerr, J. Mueller, J. Landau, K. Ndousse, K. Nguyen, L. Lovitt, M. Sellitto, N. Elhage, N. Mercado, N. DasSarma, O. Rausch, R. Lasenby, R. Larson, S. Ringer, S. Kundu, S. Kadavath, S. Johnston, S. Kravec, S. E. Showk, T. Lanham, T. Telleen-Lawton, T. Henighan, T. Hume, Y. Bai, Z. Hatfield-Dodds, B. Mann, D. Amodei, N. Joseph, S. McCandlish, T. Brown, C. Olah, J. Clark, S. R. Bowman, and J. Kaplan, “The capacity for moral self-correction in large language models,” CoRR, vol. abs/2302.07459, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2302.07459
- S. Ge, C. Zhou, R. Hou, M. Khabsa, Y. Wang, Q. Wang, J. Han, and Y. Mao, “MART: improving LLM safety with multi-round automatic red-teaming,” CoRR, vol. abs/2311.07689, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2311.07689
- Z. Zhang, J. Yang, P. Ke, and M. Huang, “Defending large language models against jailbreaking attacks through goal prioritization,” CoRR, vol. abs/2311.09096, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2311.09096
- S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. M. Lundberg, H. Nori, H. Palangi, M. T. Ribeiro, and Y. Zhang, “Sparks of artificial general intelligence: Early experiments with GPT-4,” CoRR, vol. abs/2303.12712, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2303.12712
- S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, Y. Bisk, D. Fried, U. Alon, and G. Neubig, “Webarena: A realistic web environment for building autonomous agents,” CoRR, vol. abs/2307.13854, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2307.13854
- N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: language agents with verbal reinforcement learning,” in Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., 2023. [Online]. Available: http://papers.nips.cc/paper_files/paper/2023/hash/1b44b878bb782e6954cd888628510e90-Abstract-Conference.html
- S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao, “React: Synergizing reasoning and acting in language models,” in The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. [Online]. Available: https://openreview.net/pdf?id=WE_vluYUL-X
- A. Dorri, S. S. Kanhere, and R. Jurdak, “Multi-agent systems: A survey,” IEEE Access, vol. 6, pp. 28 573–28 593, 2018. [Online]. Available: https://doi.org/10.1109/ACCESS.2018.2831228
- R. V. Guha and D. B. Lenat, “Enabling agents to work together,” Commun. ACM, vol. 37, no. 7, pp. 126–142, 1994. [Online]. Available: https://doi.org/10.1145/176789.176804
- J. D. Johnson, J. Li, and Z. Chen, “Reinforcement learning: An introduction: R.S. sutton, A.G. barto, MIT press, cambridge, MA 1998, 322 pp. ISBN 0-262-19398-1,” Neurocomputing, vol. 35, no. 1-4, pp. 205–206, 2000. [Online]. Available: https://doi.org/10.1016/S0925-2312(00)00324-6
- M. Tan, “Multi-agent reinforcement learning: Independent versus cooperative agents,” in Machine Learning, Proceedings of the Tenth International Conference, University of Massachusetts, Amherst, MA, USA, June 27-29, 1993, P. E. Utgoff, Ed. Morgan Kaufmann, 1993, pp. 330–337. [Online]. Available: https://doi.org/10.1016/b978-1-55860-307-3.50049-6
- D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. P. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, “Mastering the game of go without human knowledge,” Nat., vol. 550, no. 7676, pp. 354–359, 2017. [Online]. Available: https://doi.org/10.1038/nature24270
- C. Ma, Z. Yang, M. Gao, H. Ci, J. Gao, X. Pan, and Y. Yang, “Red teaming game: A game-theoretic framework for red teaming language models,” CoRR, vol. abs/2310.00322, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.00322
- J. Guo, B. Yang, P. Yoo, B. Y. Lin, Y. Iwasawa, and Y. Matsuo, “Suspicion-agent: Playing imperfect information games with theory of mind aware gpt-4,” arXiv preprint arXiv:2309.17277, 2023.
- J. J. Horton, “Large language models as simulated economic agents: What can we learn from homo silicus?” National Bureau of Economic Research, Tech. Rep., 2023.
- A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving language understanding by generative pre-training,” 2018.
- G. V. Aher, R. I. Arriaga, and A. T. Kalai, “Using large language models to simulate multiple humans and replicate human subject studies,” in International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202. PMLR, 2023, pp. 337–371. [Online]. Available: https://proceedings.mlr.press/v202/aher23a.html
- C. Boutilier, “Planning, learning and coordination in multiagent decision processes,” in Proceedings of the Sixth Conference on Theoretical Aspects of Rationality and Knowledge, De Zeeuwse Stromen, The Netherlands, March 17-20 1996, Y. Shoham, Ed. Morgan Kaufmann, 1996, pp. 195–210.
- M. T. Spaan, N. Vlassis, F. C. Groen et al., “High level coordination of agents based on multiagent markov decision processes with roles,” in IROS, vol. 2, 2002, pp. 66–73.
- M. V. N. Prasad, V. R. Lesser, and S. E. Lander, “Learning organizational roles for negotiated search in a multiagent system,” Int. J. Hum. Comput. Stud., vol. 48, no. 1, pp. 51–67, 1998. [Online]. Available: https://doi.org/10.1006/ijhc.1997.0160
- F. A. Fischer, M. Rovatsos, and G. Weiß, “Hierarchical reinforcement learning in communication-mediated multiagent coordination,” in 3rd International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS 2004), 19-23 August 2004, New York, NY, USA. IEEE Computer Society, 2004, pp. 1334–1335. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/AAMAS.2004.10283
- M. H. Bowling and M. M. Veloso, “Multiagent learning using a variable learning rate,” Artif. Intell., vol. 136, no. 2, pp. 215–250, 2002. [Online]. Available: https://doi.org/10.1016/S0004-3702(02)00121-2
- K. Tuyls, P. J. Hoen, and B. Vanschoenwinkel, “An evolutionary dynamical analysis of multi-agent learning in iterated games,” Auton. Agents Multi Agent Syst., vol. 12, no. 1, pp. 115–153, 2006. [Online]. Available: https://doi.org/10.1007/s10458-005-3783-9
- G. Chalkiadakis, E. Elkind, and M. J. Wooldridge, “Cooperative game theory: Basic concepts and computational challenges,” IEEE Intell. Syst., vol. 27, no. 3, pp. 86–90, 2012. [Online]. Available: https://doi.org/10.1109/MIS.2012.47