AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks (2403.04783v2)

Published 2 Mar 2024 in cs.LG, cs.CL, and cs.CR

Abstract: Despite extensive pre-training in moral alignment to prevent generating harmful information, LLMs remain vulnerable to jailbreak attacks. In this paper, we propose AutoDefense, a multi-agent defense framework that filters harmful responses from LLMs. With the response-filtering mechanism, our framework is robust against different jailbreak attack prompts, and can be used to defend different victim models. AutoDefense assigns different roles to LLM agents and employs them to complete the defense task collaboratively. The division in tasks enhances the overall instruction-following of LLMs and enables the integration of other defense components as tools. With AutoDefense, small open-source LMs can serve as agents and defend larger models against jailbreak attacks. Our experiments show that AutoDefense can effectively defense against different jailbreak attacks, while maintaining the performance at normal user request. For example, we reduce the attack success rate on GPT-3.5 from 55.74% to 7.95% using LLaMA-2-13b with a 3-agent system. Our code and data are publicly available at https://github.com/XHMY/AutoDefense.

PDF HTML Abstract

AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks

The paper entitled "AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks" presents a novel approach in mitigating the vulnerabilities of LLMs to jailbreak attacks. These attacks exploit the models' alignment processes, eliciting harmful responses despite extensive training focused on aligning LLM outputs with human values.

Overview

AutoDefense is introduced as a multi-agent defense framework that enhances response filtering in LLMs, tasked with identifying and mitigating harmful outputs. The framework employs multiple LLM agents with distinct roles to collaboratively examine and neutralize potentially harmful responses. This division of labor enables more precise instruction-following and allows for integration with other defensive components.

Key Design Elements

Multi-Agent System: AutoDefense utilizes a system where multiple agents are attributed specific tasks, such as intention analysis, prompt inference, and final judgment. This system capitalizes on the inherent alignment and divergent thinking capacities of LLMs by focusing each agent on particular subtasks.
Response Filtering: Unlike prompt-based defenses, which may degrade the quality of responses, AutoDefense applies filtering post-response generation to circumvent malicious inputs.
Adaptability and Integration: The system is designed to leverage various LLMs, including smaller, cost-effective models, reinforcing its applicability across diverse operational contexts. This adaptability allows easy integration of other defense methods, such as Llama Guard, into its framework.

Experimental Evaluation

The paper substantiates its claims through extensive experimentation involving a variety of harmful prompts and open-source LLMs. The results reveal:

A significant reduction in the Attack Success Rate (ASR). For instance, employing AutoDefense with LLaMA-2-13b reduced the ASR on GPT-3.5 from 55.74% to 7.95%.
Maintenance of high accuracy in handling regular user requests, with an overall accuracy rate of 92.91%, ensuring minimal interruption to legitimate interactions.
Demonstrated efficacy in the flexibility of agent configurations, with a particular emphasis on three-agent systems achieving optimal results.

Implications and Future Directions

The paper’s findings have profound theoretical and practical implications. The multi-agent approach challenges the conventional, monolithic LLM defense architecture, hinting at the superior effectiveness of modular and collective defense strategies.

Practically, AutoDefense offers a scalable, resource-efficient solution for enhancing the robustness of LLM deployments against malicious exploitation. This multilayered strategy ensures that improved safety does not come at the cost of performance on benign tasks.

Looking forward, the work suggests further exploration into dynamic communication patterns within multi-agent systems and the incorporation of additional advanced defense methodologies could refine and enhance the effectiveness of such frameworks even more. The multi-agent strategy, with its integrative potential, positions itself as a promising frontier in fortifying AI systems against increasingly sophisticated adversarial techniques.

In conclusion, "AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks" provides a compelling advancement in the field of AI safety, setting a robust foundation for future research and development in defending machine intelligence against adversarial manipulation.