Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning (2403.03218v7)

Published 5 Mar 2024 in cs.LG, cs.AI, cs.CL, and cs.CY

Abstract: The White House Executive Order on Artificial Intelligence highlights the risks of LLMs empowering malicious actors in developing biological, cyber, and chemical weapons. To measure these risks of malicious use, government institutions and major AI labs are developing evaluations for hazardous capabilities in LLMs. However, current evaluations are private, preventing further research into mitigating risk. Furthermore, they focus on only a few, highly specific pathways for malicious use. To fill these gaps, we publicly release the Weapons of Mass Destruction Proxy (WMDP) benchmark, a dataset of 3,668 multiple-choice questions that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security. WMDP was developed by a consortium of academics and technical consultants, and was stringently filtered to eliminate sensitive information prior to public release. WMDP serves two roles: first, as an evaluation for hazardous knowledge in LLMs, and second, as a benchmark for unlearning methods to remove such hazardous knowledge. To guide progress on unlearning, we develop RMU, a state-of-the-art unlearning method based on controlling model representations. RMU reduces model performance on WMDP while maintaining general capabilities in areas such as biology and computer science, suggesting that unlearning may be a concrete path towards reducing malicious use from LLMs. We release our benchmark and code publicly at https://wmdp.ai

Introducing the WMDP Benchmark for Evaluating and Mitigating Malicious Use of LLMs

Overview of WMDP

The Weapons of Mass Destruction Proxy (WMDP) benchmark represents a significant step forward in the assessment and mitigation of risks posed by LLMs in enabling malicious actors in the biosecurity, cybersecurity, and chemical security domains. Developed through a collaboration of academics and technical consultants, WMDP addresses a critical gap in the current evaluation landscape for hazardous knowledge embedded within LLMs. With its release, WMDP aims to serve as both a tool for measuring LLMs' hazardous capabilities and a guiding benchmark for research into unlearning methods that can remove such capabilities.

Key Features of WMDP

WMDP introduces a dataset of 1,574 expert-written multiple-choice questions across targeted domains, meticulously crafted to proxy for hazardous knowledge while strictly excluding sensitive information to avoid misuse. This dataset underpins two primary applications:

  • Evaluation of hazardous knowledge: WMDP enables a systematic assessment of LLMs' potential to inadvertently or maliciously contribute to the development of weapons of mass destruction.
  • Benchmark for unlearning methods: By focusing on the ability of models to unlearn specific hazardous knowledge, WMDP acts as a benchmark to drive progress in developing and refining techniques for safely mitigating these risks without compromising models' general capabilities.

Unlearning with Cut

In tandem with the benchmark's introduction, we propose Contrastive Unlearn Tuning (Cut), an innovative method designed to specifically target and eliminate hazardous knowledge from LLMs while preserving their performance on general tasks. Cut operates by adjusting model representations to effectively "forget" unwanted knowledge, tested extensively using WMDP.

Our experiments with Cut provide promising evidence of its efficacy. Notably, Cut managed to significantly reduce model performance on WMDP-related tasks, implying successful unlearning, while maintaining performance on broad academic benchmarks and general fluency metrics. These outcomes underscore the potential of directed unlearning approaches in enhancing the safety of LLMs without impairing their utility.

Future Directions

As the landscape of AI and machine learning evolves, benchmarks and methods such as WMDP and Cut play a crucial role in navigating the dual-use nature of these technologies. However, the static nature of benchmarks like WMDP and the ongoing development of technologies present ongoing challenges, emphasizing the need for continual updates and adaptations of these tools.

Moreover, the application of unlearning methods, while a vital safety measure, must be balanced with the preservation of beneficial capabilities, especially in domains where knowledge inherently carries dual-use implications. Future research must strive for unlearning methods that are precise, minimizing the unintended loss of useful knowledge.

Conclusion

The release of the WMDP benchmark and the development of the Cut unlearning method represent key advancements in our collective efforts to safeguard against the malicious use of LLMs. By providing a framework for both evaluating hazardous knowledge within LLMs and guiding the development of unlearning methods, WMDP and Cut contribute to the broader goal of aligning AI technologies with societal values and safety requirements. As we move forward, the continued iteration on benchmarks and unlearning methodologies, informed by interdisciplinary insights, will be essential in mitigating risks without stifling the positive potential of AI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (94)
  1. 01-ai. GitHub - 01-ai/Yi: A series of large language models trained from scratch by developers @01-ai — github.com. https://github.com/01-ai/Yi, 2023.
  2. Anthropic. Anthropic’s Responsible Scaling Policy — anthropic.com. https://www.anthropic.com/index/anthropics-responsible-scaling-policy, 2023.
  3. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
  4. Leace: Perfect linear concept erasure in closed form. NeurIPS, 2023.
  5. Purple llama cyberseceval: A secure coding benchmark for language models, 2023.
  6. Autonomous chemical research with large language models. Nature, 624(7992):570–578, 2023.
  7. Discovering latent knowledge in language models without supervision, 2022.
  8. Towards making systems forget with machine unlearning. In IEEE S&P, 2015.
  9. CCPA. California consumer privacy act, 2018. https://leginfo.legislature.ca.gov/faces/billTextClient.xhtml?bill_id=201720180AB375.
  10. Jailbreaking black box large language models in twenty queries, 2023.
  11. Towards machine unlearning benchmarks: Forgetting the personal identities in facial recognition systems, 2023.
  12. Council of European Union. Council regulation (EU) no 269/2014, 2014. http://eur-lex.europa.eu/legal-content/EN/TXT/?qid=1416170084502&uri=CELEX:32014R0269.
  13. Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335, 2023.
  14. EAR. Export administration regulations (ear), 15 cfr parts 730-774. https://www.ecfr.gov/current/title-15/subtitle-B/chapter-VII/subchapter-C, 2024.
  15. Who’s harry potter? approximate unlearning in llms. arXiv preprint arXiv:2310.02238, 2023.
  16. Kevin M Esvelt. Inoculating science against potential pandemics and information hazards. PLoS Pathog., 14(10):e1007286, October 2018.
  17. Llm agents can autonomously hack websites, 2024.
  18. World Economic Forum. Global cybersecurity outlook 2024, 2024. URL https://www3.weforum.org/docs/WEF_Global_Cybersecurity_Outlook_2024.pdf.
  19. Fast machine unlearning without retraining through selective synaptic dampening. AAAI, 2024.
  20. A framework for few-shot language model evaluation, September 2021.
  21. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462, 2020.
  22. Towards adversarial evaluations for inexact machine unlearning, 2023.
  23. Corrective machine unlearning, 2024.
  24. Forgetting outside the box: Scrubbing deep networks of information accessible from input-output observations. In ECCV, 2020.
  25. Google. Neurips 2023 machine unlearning challenge, 2023. URL https://unlearning-challenge.github.io/.
  26. Will releasing the weights of future large language models grant widespread access to pandemic agents?, 2023.
  27. The emerging threat of ai-driven cyber attacks: A review. Applied Artificial Intelligence, 36(1):2037254, 2022.
  28. Gradient-based adversarial attacks against text transformers, 2021.
  29. X-risk analysis for ai research, 2022.
  30. Aligning ai with shared human values. arXiv preprint arXiv:2008.02275, 2020a.
  31. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020b.
  32. Unsolved problems in ml safety. arXiv preprint arXiv:2109.13916, 2021.
  33. Lora: Low-rank adaptation of large language models, 2021.
  34. Intelligence-driven computer network defense informed by analysis of adversary campaigns and intrusion kill chains. Technical report, Lockheed Martin Corporation, 2011.
  35. Editing models with task arithmetic. In The Eleventh International Conference on Learning Representations, 2023.
  36. Llama guard: Llm-based input-output safeguard for human-ai conversations, 2023.
  37. ITAR. International traffic in arms regulations (itar), 22 cfr parts 120-130. https://www.ecfr.gov/current/title-22/chapter-I/subchapter-M, 2024.
  38. Knowledge unlearning for mitigating privacy risks in language models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, ACL, 2023.
  39. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
  40. Mixtral of experts, 2024.
  41. Automatically auditing large language models via discrete optimization, 2023.
  42. Megan Kinniment and Lucas Jun Koba Sato. Haoxing du, brian goodrich, max hasin, lawrence chan, luke harold miles, tao r. lin, hjalmar wijk, joel burget, aaron ho, elizabeth barnes, and paul christiano. evaluating language-model agents on realistic autonomous tasks. Evaluating Language-Model Agents on Realistic Autonomous Tasks. Research paper, Alignment Research Center, 2023.
  43. Towards unbounded machine unlearning. NeurIPS, 2023.
  44. Paperqa: Retrieval-augmented generative agent for scientific research. arXiv preprint arXiv:2312.07559, 2023.
  45. Information hazards in biotechnology. Risk Anal., 39(5):975–981, May 2019.
  46. Halueval: A large-scale hallucination evaluation benchmark for large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6449–6464, 2023.
  47. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
  48. Learn to forget: Machine unlearning via neuron masking. arXiv preprint arXiv:2003.10933, 2020.
  49. Towards Safer Large Language Models through Machine Unlearning. arXiv e-prints, art. arXiv:2402.10058, February 2024. doi: 10.48550/arXiv.2402.10058.
  50. Tofu: A task of fictitious unlearning for llms, 2024.
  51. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249, 2024.
  52. Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems, 35, 2022.
  53. Pointer sentinel mixture models, 2016.
  54. Mistral AI team. Mistral 7b. Mistral, 2023. URL https://mistral.ai/news/announcing-mistral-7b/.
  55. The Operational Risks of AI in Large-Scale Biological Attacks: Results of a Red-Team Study. RAND Corporation, Santa Monica, CA, 2024. doi: 10.7249/RRA2977-2.
  56. Mitigating harm in language models with conditional-likelihood filtration. ArXiv, abs/2108.07790, 2021.
  57. NIST. AI Risk Management Framework — nist.gov. https://www.nist.gov/itl/ai-risk-management-framework, 2023.
  58. OpenAI. Gpt-4 technical report, 2023a.
  59. OpenAI. Preparedness — openai.com. https://openai.com/safety/preparedness, 2023b.
  60. OpenAI. Building an early warning system for LLM-aided biological threat creation — openai.com. https://openai.com/research/building-an-early-warning-system-for-llm-aided-biological-threat-creation, 2024.
  61. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  62. Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark. In International Conference on Machine Learning, pages 26837–26867. PMLR, 2023.
  63. Feedback loops with language models drive in-context reward hacking. arXiv preprint arXiv:2402.06627, 2024.
  64. Ai deception: A survey of examples, risks, and potential solutions. arXiv preprint arXiv:2308.14752, 2023.
  65. In-context unlearning: Language models as few shot unlearners. arXiv preprint arXiv:2310.07579, 2023.
  66. Exploiting novel gpt-4 apis, 2023.
  67. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693, 2023.
  68. Direct preference optimization: Your language model is secretly a reward model, 2023.
  69. Gpqa: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022, 2023.
  70. Jonas B. Sandbrink. Artificial intelligence and biological misuse: Differentiating risks of language models and biological design tools, 2023.
  71. Technical report: Large language models can strategically deceive their users when put under pressure. arXiv preprint arXiv:2311.07590, 2023.
  72. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting, 2023.
  73. Toby Shevlane. Structured access: an emerging paradigm for safe ai deployment, 2022.
  74. Model evaluation for extreme risks. arXiv preprint arXiv:2305.15324, 2023.
  75. Mitre att&ck: Design and philosophy. Technical report, MITRE Corporation, 2020.
  76. Zephyr: Direct distillation of lm alignment, 2023.
  77. Activation addition: Steering language models without optimization, 2023.
  78. UK AI Safety Summit. The Bletchley Declaration by Countries Attending the AI Safety Summit, 1-2 November 2023 — gov.uk. https://www.gov.uk/government/publications/ai-safety-summit-2023-the-bletchley-declaration/the-bletchley-declaration-by-countries-attending-the-ai-safety-summit-1-2-november-2023, 2023.
  79. UK Cabinet Office. National risk register. Technical report, UK Cabinet Office, 2023.
  80. Dual use of artificial-intelligence-powered drug discovery. Nature Machine Intelligence, 4(3):189–191, 2022.
  81. Universal adversarial triggers for attacking and analyzing nlp. arXiv preprint arXiv:1908.07125, 2019.
  82. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
  83. The White House. Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. https://www.whitehouse.gov/briefing-room/presidential-actions/2023/10/30/executive-order-on-the-safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence/, 2023.
  84. Shadow alignment: The ease of subverting safely-aligned language models, 2023.
  85. Fuzzllm: A novel and universal fuzzing framework for proactively discovering jailbreak vulnerabilities in large language models, 2023a.
  86. Large language model unlearning. arXiv preprint arXiv:2310.10683, 2023b.
  87. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher, 2023.
  88. Removing rlhf protections in gpt-4 via fine-tuning, 2023.
  89. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023.
  90. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023a.
  91. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023b.
  92. Fine-tuning language models from human preferences, 2020.
  93. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023a.
  94. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (57)
  1. Nathaniel Li (7 papers)
  2. Alexander Pan (9 papers)
  3. Anjali Gopal (3 papers)
  4. Summer Yue (12 papers)
  5. Daniel Berrios (1 paper)
  6. Alice Gatti (11 papers)
  7. Justin D. Li (3 papers)
  8. Ann-Kathrin Dombrowski (9 papers)
  9. Shashwat Goel (12 papers)
  10. Long Phan (21 papers)
  11. Gabriel Mukobi (10 papers)
  12. Nathan Helm-Burger (4 papers)
  13. Rassin Lababidi (1 paper)
  14. Lennart Justen (5 papers)
  15. Andrew B. Liu (1 paper)
  16. Michael Chen (24 papers)
  17. Isabelle Barrass (2 papers)
  18. Oliver Zhang (7 papers)
  19. Xiaoyuan Zhu (5 papers)
  20. Rishub Tamirisa (5 papers)
Citations (76)
Reddit Logo Streamline Icon: https://streamlinehq.com