Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks (2309.17410v1)

Published 29 Sep 2023 in cs.CL, cs.AI, and cs.LG

Abstract: Pretrained LLMs sometimes possess knowledge that we do not wish them to, including memorized personal information and knowledge that could be used to harm people. They can also output toxic or harmful text. To mitigate these safety and informational issues, we propose an attack-and-defense framework for studying the task of deleting sensitive information directly from model weights. We study direct edits to model weights because (1) this approach should guarantee that particular deleted information is never extracted by future prompt attacks, and (2) it should protect against whitebox attacks, which is necessary for making claims about safety/privacy in a setting where publicly available model weights could be used to elicit sensitive information. Our threat model assumes that an attack succeeds if the answer to a sensitive question is located among a set of B generated candidates, based on scenarios where the information would be insecure if the answer is among B candidates. Experimentally, we show that even state-of-the-art model editing methods such as ROME struggle to truly delete factual information from models like GPT-J, as our whitebox and blackbox attacks can recover "deleted" information from an edited model 38% of the time. These attacks leverage two key observations: (1) that traces of deleted information can be found in intermediate model hidden states, and (2) that applying an editing method for one question may not delete information across rephrased versions of the question. Finally, we provide new defense methods that protect against some extraction attacks, but we do not find a single universally effective defense method. Our results suggest that truly deleting sensitive information is a tractable but difficult problem, since even relatively low attack success rates have potentially severe societal implications for real-world deployment of LLMs.

Authors (3)

Vaidehi Patil (9 papers)
Peter Hase (29 papers)
Mohit Bansal (304 papers)

Citations (71)

View on Semantic Scholar

Summary

An Evaluation of Sensitive Information Deletion from LLMs

This paper addresses the critical question of whether sensitive information can be effectively removed from LLMs to mitigate privacy and safety risks. Given the increasing deployment of LLMs in various applications, it is essential to understand how these models can inadvertently store and reproduce sensitive data that users expect to be private. The research explores a framework for deleting sensitive information directly from model weights, evaluating the efficacy of various attack and defense strategies.

Problem Definition and Significance

The paper highlights the significance of this challenge by recalling instances where LLMs have memorized personal data and other sensitive information that could lead to harmful outcomes. Traditional mechanisms, such as Reinforcement Learning from Human Feedback (RLHF), have proven insufficient in preventing the extraction of sensitive information via adversarial prompts. Therefore, this research underscores the need to develop robust model editing techniques designed to permanently eliminate specific knowledge from a model's learned parameters.

Methodology

The authors propose a dual approach involving attack mechanisms to extract information purportedly deleted from LLMs and corresponding defenses to thwart these extraction attempts. The research's attack framework comprises both whitebox and blackbox strategies. In the whitebox scenario, attackers can access model weights and intermediate states, while in the blackbox scenario, they are limited to sampling outputs. The paper focuses on two whitebox approaches: (1) the Head Projection Attack, which exploits intermediate layer outputs, and (2) the Probability Delta Attack, which monitors token probability changes layer-by-layer. The blackbox attack utilizes adversarial rephrasing of inputs to bypass model edits aimed at information deletion.

Experimental Findings

Experiments conducted using various LLMs, including GPT-J and Llama-2, demonstrate the complexity and, often, the insufficiency of conventional defense mechanisms. The paper finds that state-of-the-art editing methods like ROME struggle to irreversibly delete factual information. Whitebox attacks, particularly the Head Projection Attack, can recover hidden information with high success rates, achieving up to 38% recovery of deleted facts in some settings. Meanwhile, the blackbox approach utilizing input rephrasing also presents a substantial risk, with a 29% success rate, revealing the limitations of current paraphrase generalization in model editing.

Defense Strategies

The paper introduces several defense strategies aimed at improving the resilience of LLMs against these attack scenarios. The Max-Entropy Defense, which extends editing objectives to maximize intermediate representational entropy, demonstrates promising results in lowering whitebox attack success rates to 2.4%, a significant reduction compared to other strategies. However, no single defense emerged universally effective; for instance, the input rephrasing defense showed limited success against novel phrasings not included in the defense paradigm.

Implications and Future Directions

The implications of these findings are significant. The ability to effectively and irretrievably delete sensitive facts from LLMs impacts user privacy and safety in deploying AI systems. Failures to establish strong defenses against the sophisticated attack strategies underscore the ongoing cat-and-mouse nature of privacy-oriented model development. The research suggests that the notion of secure information deletion is inherently complex and fraught with challenges, emphasizing that model editing methods must continue to evolve rapidly to keep pace with novel extraction techniques.

This research opens several avenues for exploration, including refining model-editing algorithms, enhancing the cross-context generalization of modifications, and developing more rigorous testing protocols that can preemptively address potential privacy violations. As AI systems become further entrenched in societal infrastructure, ensuring their safe and private operation remains a paramount objective.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - Vaidehi99/InfoDeletionAttacks (44 stars)

Tweets

https://twitter.com/peterbhase/status/1757903941617189375

https://twitter.com/mohitban47/status/1787157411100328438

https://twitter.com/niloofar_mire/status/1752037486686257231