Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks (2403.04783v2)

Published 2 Mar 2024 in cs.LG, cs.CL, and cs.CR
AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks

Abstract: Despite extensive pre-training in moral alignment to prevent generating harmful information, LLMs remain vulnerable to jailbreak attacks. In this paper, we propose AutoDefense, a multi-agent defense framework that filters harmful responses from LLMs. With the response-filtering mechanism, our framework is robust against different jailbreak attack prompts, and can be used to defend different victim models. AutoDefense assigns different roles to LLM agents and employs them to complete the defense task collaboratively. The division in tasks enhances the overall instruction-following of LLMs and enables the integration of other defense components as tools. With AutoDefense, small open-source LMs can serve as agents and defend larger models against jailbreak attacks. Our experiments show that AutoDefense can effectively defense against different jailbreak attacks, while maintaining the performance at normal user request. For example, we reduce the attack success rate on GPT-3.5 from 55.74% to 7.95% using LLaMA-2-13b with a 3-agent system. Our code and data are publicly available at https://github.com/XHMY/AutoDefense.

AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks

The paper entitled "AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks" presents a novel approach in mitigating the vulnerabilities of LLMs to jailbreak attacks. These attacks exploit the models' alignment processes, eliciting harmful responses despite extensive training focused on aligning LLM outputs with human values.

Overview

AutoDefense is introduced as a multi-agent defense framework that enhances response filtering in LLMs, tasked with identifying and mitigating harmful outputs. The framework employs multiple LLM agents with distinct roles to collaboratively examine and neutralize potentially harmful responses. This division of labor enables more precise instruction-following and allows for integration with other defensive components.

Key Design Elements

  • Multi-Agent System: AutoDefense utilizes a system where multiple agents are attributed specific tasks, such as intention analysis, prompt inference, and final judgment. This system capitalizes on the inherent alignment and divergent thinking capacities of LLMs by focusing each agent on particular subtasks.
  • Response Filtering: Unlike prompt-based defenses, which may degrade the quality of responses, AutoDefense applies filtering post-response generation to circumvent malicious inputs.
  • Adaptability and Integration: The system is designed to leverage various LLMs, including smaller, cost-effective models, reinforcing its applicability across diverse operational contexts. This adaptability allows easy integration of other defense methods, such as Llama Guard, into its framework.

Experimental Evaluation

The paper substantiates its claims through extensive experimentation involving a variety of harmful prompts and open-source LLMs. The results reveal:

  • A significant reduction in the Attack Success Rate (ASR). For instance, employing AutoDefense with LLaMA-2-13b reduced the ASR on GPT-3.5 from 55.74% to 7.95%.
  • Maintenance of high accuracy in handling regular user requests, with an overall accuracy rate of 92.91%, ensuring minimal interruption to legitimate interactions.
  • Demonstrated efficacy in the flexibility of agent configurations, with a particular emphasis on three-agent systems achieving optimal results.

Implications and Future Directions

The paper’s findings have profound theoretical and practical implications. The multi-agent approach challenges the conventional, monolithic LLM defense architecture, hinting at the superior effectiveness of modular and collective defense strategies.

Practically, AutoDefense offers a scalable, resource-efficient solution for enhancing the robustness of LLM deployments against malicious exploitation. This multilayered strategy ensures that improved safety does not come at the cost of performance on benign tasks.

Looking forward, the work suggests further exploration into dynamic communication patterns within multi-agent systems and the incorporation of additional advanced defense methodologies could refine and enhance the effectiveness of such frameworks even more. The multi-agent strategy, with its integrative potential, positions itself as a promising frontier in fortifying AI systems against increasingly sophisticated adversarial techniques.

In conclusion, "AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks" provides a compelling advancement in the field of AI safety, setting a robust foundation for future research and development in defending machine intelligence against adversarial manipulation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Gabriel Alon and Michael Kamfonas. 2023. Detecting language model attacks with perplexity. arXiv preprint arXiv:2308.14132.
  3. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  5. Defending against alignment-breaking attacks via robustly aligned llm. arXiv preprint arXiv:2309.14348.
  6. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419.
  7. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  8. Attack prompt generation for red teaming and defending large language models. arXiv preprint arXiv:2310.12505.
  9. Multilingual jailbreak challenges in large language models. arXiv preprint arXiv:2310.06474.
  10. Anticipating safety issues in e2e conversational ai: Framework and tooling. arXiv preprint arXiv:2107.03451.
  11. Build it break it fix it for dialogue safety: Robustness from adversarial human attack. arXiv preprint arXiv:1908.06083.
  12. Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325.
  13. The capacity for moral self-correction in large language models. arXiv preprint arXiv:2302.07459.
  14. Llm self defense: By self examination, llms know they are being tricked. arXiv preprint arXiv:2308.07308.
  15. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352.
  16. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674.
  17. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614.
  18. Mixtral of experts. arXiv preprint arXiv:2401.04088.
  19. The impact of reasoning step length on large language models. arXiv preprint arXiv:2401.04925.
  20. Automatically auditing large language models via discrete optimization. arXiv preprint arXiv:2303.04381.
  21. Exploring social bias in chatbots using stereotype knowledge. In Wnlp@ Acl, pages 177–180.
  22. Camel: Communicative agents for" mind" exploration of large scale language model society. arXiv preprint arXiv:2303.17760.
  23. Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197.
  24. Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118.
  25. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860.
  26. Prompt injection attacks and defenses in llm-integrated applications. arXiv preprint arXiv:2310.12815.
  27. Black box adversarial prompting for foundation models. In The Second Workshop on New Frontiers in Adversarial Machine Learning.
  28. Tree of attacks: Jailbreaking black-box llms automatically. arXiv preprint arXiv:2312.02119.
  29. R OpenAI. 2023. Gpt-4 technical report. arXiv, pages 2303–08774.
  30. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  31. Bergeron: Combating adversarial attacks through a conscience-based alignment framework. arXiv preprint arXiv:2312.00029.
  32. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693.
  33. Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684.
  34. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825.
  35. Why do universal adversarial attacks work on large language models?: Geometry might be the answer. In The Second Workshop on New Frontiers in Adversarial Machine Learning.
  36. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7.
  37. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  38. Aligning large language models with human: A survey. arXiv preprint arXiv:2307.12966.
  39. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483.
  40. Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387.
  41. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155.
  42. Defending chatgpt against jailbreak attack via self-reminders. Nature Machine Intelligence, pages 1–11.
  43. Llm jailbreak attack versus defense techniques–a comprehensive study. arXiv preprint arXiv:2402.13457.
  44. Intention analysis prompting makes large language models a good jailbreak defender. arXiv preprint arXiv:2401.06561.
  45. Defending large language models against jailbreaking attacks through goal prioritization. arXiv preprint arXiv:2311.09096.
  46. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  47. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yifan Zeng (23 papers)
  2. Yiran Wu (12 papers)
  3. Xiao Zhang (435 papers)
  4. Huazheng Wang (44 papers)
  5. Qingyun Wu (47 papers)
Citations (34)
Youtube Logo Streamline Icon: https://streamlinehq.com