WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs (2406.18495v3)

Published 26 Jun 2024 in cs.CL

Abstract: We introduce WildGuard -- an open, light-weight moderation tool for LLM safety that achieves three goals: (1) identifying malicious intent in user prompts, (2) detecting safety risks of model responses, and (3) determining model refusal rate. Together, WildGuard serves the increasing needs for automatic safety moderation and evaluation of LLM interactions, providing a one-stop tool with enhanced accuracy and broad coverage across 13 risk categories. While existing open moderation tools such as Llama-Guard2 score reasonably well in classifying straightforward model interactions, they lag far behind a prompted GPT-4, especially in identifying adversarial jailbreaks and in evaluating models' refusals, a key measure for evaluating safety behaviors in model responses. To address these challenges, we construct WildGuardMix, a large-scale and carefully balanced multi-task safety moderation dataset with 92K labeled examples that cover vanilla (direct) prompts and adversarial jailbreaks, paired with various refusal and compliance responses. WildGuardMix is a combination of WildGuardTrain, the training data of WildGuard, and WildGuardTest, a high-quality human-annotated moderation test set with 5K labeled items covering broad risk scenarios. Through extensive evaluations on WildGuardTest and ten existing public benchmarks, we show that WildGuard establishes state-of-the-art performance in open-source safety moderation across all the three tasks compared to ten strong existing open-source moderation models (e.g., up to 26.4% improvement on refusal detection). Importantly, WildGuard matches and sometimes exceeds GPT-4 performance (e.g., up to 3.9% improvement on prompt harmfulness identification). WildGuard serves as a highly effective safety moderator in an LLM interface, reducing the success rate of jailbreak attacks from 79.8% to 2.4%.

PDF HTML Abstract

WildGuard: LLM Moderation Tool for Enhancing Safety and Refusal Detection

WildGuard introduces a state-of-the-art tool for moderating interactions involving LLMs. It uniquely addresses three key tasks in content moderation: identifying harmful user prompts, detecting harmful model responses, and determining the refusal rate of the model. These functions are essential for the safe deployment of LLMs in real-world applications where user interactions can vary widely in intent and content.

Dataset Construction and Model Training

To develop WildGuard, the authors created WildGuardMix, a comprehensive dataset including a training set (WildGuardTrain) and a test set (WildGuardTest). This dataset covers 13 risk categories and includes 92K labeled examples. The collected data blends synthetic prompts, human-annotated evaluations, and real-world interactions, capturing both benign and harmful queries in vanilla (direct) and adversarial forms. Such a diverse dataset ensures broad coverage and robustness in moderation capabilities.

WildGuard excels at leveraging its diverse training data for multi-task learning. The dataset construction involves:

Synthetic Harmful Prompts: Generated using a structured pipeline to ensure realistic and varied scenarios that challenge the moderation tool.
Adversarial and Vanilla Prompts: Including prompts crafted through state-of-the-art methods such as WildTeaming, ensuring the model can handle complex, adversarial user interactions.
Real-World Interactions: Extracted from datasets like LMSYS-Chat-1M and WildChat, ensuring real-world applicability.

The training process, which employs Mistral-7B-v0.3 as the base model, emphasizes multi-task learning, combining tasks of harmfulness and refusal detection into a unified framework. This approach optimizes the tool's accuracy across all tasks.

Evaluation and Results

WildGuard was evaluated on multiple benchmarks, including WildGuardTest and other public datasets like ToxicChat, Harmbench, and SafeRLHF. Key findings from these evaluations are:

Prompt Harmfulness: WildGuard outperforms existing models and matches GPT-4, particularly excelling in detecting harmfulness in adversarial prompts where precision is critical.
Response Harmfulness: Achieving similar or better performance compared to current state-of-the-art models.
Refusal Detection: Significantly improves refusal detection accuracy, reducing the gap with GPT-4 and outperforming other open models.

The tool was further validated through practical demonstrations, showing its efficacy in moderating human-LLM interactions. When integrated with LLM interfaces, WildGuard significantly reduced the success rate of jailbreak attacks (79.8% to 2.4%) while maintaining a low refusal-to-answer rate for benign prompts.

Implications and Future Directions

The development of WildGuard holds significant implications for both practical deployments and theoretical advancements in AI safety:

Practical Impact: Provides a robust, open-source alternative to costly, non-static API tools, making safe LLM deployment more accessible.
Theoretical Contributions: Enhances our understanding of multi-task learning in safety moderation, showcasing the benefits of a diverse and comprehensive training dataset.

Future research could extend WildGuard’s capabilities by integrating finer-grained classification of harmful content categories. Additionally, ongoing advancements in adversarial attack methods will necessitate continuous updates and expansions of the training data to maintain robustness.

In conclusion, WildGuard represents a significant advancement in LLM safety moderation, providing an effective, multi-task solution that bridges the gap between open-source tools and proprietary models like GPT-4. The release of WildGuard and its accompanying datasets is a valuable step towards democratizing safe and responsible AI applications.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Seungju Han (33 papers)
Kavel Rao (6 papers)
Allyson Ettinger (29 papers)
Liwei Jiang (53 papers)
Bill Yuchen Lin (72 papers)
Nathan Lambert (37 papers)
Yejin Choi (287 papers)
Nouha Dziri (39 papers)

Citations (26)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/natolambert/status/1806348601859399805

https://twitter.com/kavel_r/status/1807120924392689865

https://twitter.com/liweijianglw/status/1844886246251954395

https://twitter.com/gm8xx8/status/1806145234113265947

https://twitter.com/GptMaestro/status/1806993750734069937