Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

80 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

326 5 1

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning (2403.03218v7)

Published 5 Mar 2024 in cs.LG, cs.AI, cs.CL, and cs.CY

Abstract: The White House Executive Order on Artificial Intelligence highlights the risks of LLMs empowering malicious actors in developing biological, cyber, and chemical weapons. To measure these risks of malicious use, government institutions and major AI labs are developing evaluations for hazardous capabilities in LLMs. However, current evaluations are private, preventing further research into mitigating risk. Furthermore, they focus on only a few, highly specific pathways for malicious use. To fill these gaps, we publicly release the Weapons of Mass Destruction Proxy (WMDP) benchmark, a dataset of 3,668 multiple-choice questions that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security. WMDP was developed by a consortium of academics and technical consultants, and was stringently filtered to eliminate sensitive information prior to public release. WMDP serves two roles: first, as an evaluation for hazardous knowledge in LLMs, and second, as a benchmark for unlearning methods to remove such hazardous knowledge. To guide progress on unlearning, we develop RMU, a state-of-the-art unlearning method based on controlling model representations. RMU reduces model performance on WMDP while maintaining general capabilities in areas such as biology and computer science, suggesting that unlearning may be a concrete path towards reducing malicious use from LLMs. We release our benchmark and code publicly at https://wmdp.ai

PDF HTML Abstract

Introducing the WMDP Benchmark for Evaluating and Mitigating Malicious Use of LLMs

Overview of WMDP

The Weapons of Mass Destruction Proxy (WMDP) benchmark represents a significant step forward in the assessment and mitigation of risks posed by LLMs in enabling malicious actors in the biosecurity, cybersecurity, and chemical security domains. Developed through a collaboration of academics and technical consultants, WMDP addresses a critical gap in the current evaluation landscape for hazardous knowledge embedded within LLMs. With its release, WMDP aims to serve as both a tool for measuring LLMs' hazardous capabilities and a guiding benchmark for research into unlearning methods that can remove such capabilities.

Key Features of WMDP

WMDP introduces a dataset of 1,574 expert-written multiple-choice questions across targeted domains, meticulously crafted to proxy for hazardous knowledge while strictly excluding sensitive information to avoid misuse. This dataset underpins two primary applications:

Evaluation of hazardous knowledge: WMDP enables a systematic assessment of LLMs' potential to inadvertently or maliciously contribute to the development of weapons of mass destruction.
Benchmark for unlearning methods: By focusing on the ability of models to unlearn specific hazardous knowledge, WMDP acts as a benchmark to drive progress in developing and refining techniques for safely mitigating these risks without compromising models' general capabilities.

Unlearning with Cut

In tandem with the benchmark's introduction, we propose Contrastive Unlearn Tuning (Cut), an innovative method designed to specifically target and eliminate hazardous knowledge from LLMs while preserving their performance on general tasks. Cut operates by adjusting model representations to effectively "forget" unwanted knowledge, tested extensively using WMDP.

Our experiments with Cut provide promising evidence of its efficacy. Notably, Cut managed to significantly reduce model performance on WMDP-related tasks, implying successful unlearning, while maintaining performance on broad academic benchmarks and general fluency metrics. These outcomes underscore the potential of directed unlearning approaches in enhancing the safety of LLMs without impairing their utility.

Future Directions

As the landscape of AI and machine learning evolves, benchmarks and methods such as WMDP and Cut play a crucial role in navigating the dual-use nature of these technologies. However, the static nature of benchmarks like WMDP and the ongoing development of technologies present ongoing challenges, emphasizing the need for continual updates and adaptations of these tools.

Moreover, the application of unlearning methods, while a vital safety measure, must be balanced with the preservation of beneficial capabilities, especially in domains where knowledge inherently carries dual-use implications. Future research must strive for unlearning methods that are precise, minimizing the unintended loss of useful knowledge.

Conclusion

The release of the WMDP benchmark and the development of the Cut unlearning method represent key advancements in our collective efforts to safeguard against the malicious use of LLMs. By providing a framework for both evaluating hazardous knowledge within LLMs and guiding the development of unlearning methods, WMDP and Cut contribute to the broader goal of aligning AI technologies with societal values and safety requirements. As we move forward, the continued iteration on benchmarks and unlearning methodologies, informed by interdisciplinary insights, will be essential in mitigating risks without stifling the positive potential of AI.

PDF Markdown Bookmark Chat (Pro)

References (94)

Authors (57)

Nathaniel Li (7 papers)
Alexander Pan (9 papers)
Anjali Gopal (3 papers)
Summer Yue (12 papers)
Daniel Berrios (1 paper)
Alice Gatti (11 papers)
Justin D. Li (3 papers)
Ann-Kathrin Dombrowski (9 papers)
Shashwat Goel (12 papers)
Long Phan (21 papers)
Gabriel Mukobi (10 papers)
Nathan Helm-Burger (4 papers)
Rassin Lababidi (1 paper)
Lennart Justen (5 papers)
Andrew B. Liu (1 paper)
Michael Chen (24 papers)
Isabelle Barrass (2 papers)
Oliver Zhang (7 papers)
Xiaoyuan Zhu (5 papers)
Rishub Tamirisa (5 papers)

Citations (76)

View on Semantic Scholar

Tweets

https://twitter.com/DanHendrycks/status/1765438544548659468

https://twitter.com/alexandr_wang/status/1765442959401746661

https://twitter.com/ai_risks/status/1765439554352513453

https://twitter.com/danielxberrios/status/1765470045952741467

https://twitter.com/fly51fly/status/1765492816044806210

https://twitter.com/elie/status/1786842622126096744

HackerNews

The WMDP Benchmark: Measuring and Reducing Malicious Use with Unlearning (1 point, 0 comments)

Measuring and Reducing Malicious Use With Unlearning (5 points, 1 comment)