An Evaluation of Sensitive Information Deletion from LLMs
This paper addresses the critical question of whether sensitive information can be effectively removed from LLMs to mitigate privacy and safety risks. Given the increasing deployment of LLMs in various applications, it is essential to understand how these models can inadvertently store and reproduce sensitive data that users expect to be private. The research explores a framework for deleting sensitive information directly from model weights, evaluating the efficacy of various attack and defense strategies.
Problem Definition and Significance
The paper highlights the significance of this challenge by recalling instances where LLMs have memorized personal data and other sensitive information that could lead to harmful outcomes. Traditional mechanisms, such as Reinforcement Learning from Human Feedback (RLHF), have proven insufficient in preventing the extraction of sensitive information via adversarial prompts. Therefore, this research underscores the need to develop robust model editing techniques designed to permanently eliminate specific knowledge from a model's learned parameters.
Methodology
The authors propose a dual approach involving attack mechanisms to extract information purportedly deleted from LLMs and corresponding defenses to thwart these extraction attempts. The research's attack framework comprises both whitebox and blackbox strategies. In the whitebox scenario, attackers can access model weights and intermediate states, while in the blackbox scenario, they are limited to sampling outputs. The paper focuses on two whitebox approaches: (1) the Head Projection Attack, which exploits intermediate layer outputs, and (2) the Probability Delta Attack, which monitors token probability changes layer-by-layer. The blackbox attack utilizes adversarial rephrasing of inputs to bypass model edits aimed at information deletion.
Experimental Findings
Experiments conducted using various LLMs, including GPT-J and Llama-2, demonstrate the complexity and, often, the insufficiency of conventional defense mechanisms. The paper finds that state-of-the-art editing methods like ROME struggle to irreversibly delete factual information. Whitebox attacks, particularly the Head Projection Attack, can recover hidden information with high success rates, achieving up to 38% recovery of deleted facts in some settings. Meanwhile, the blackbox approach utilizing input rephrasing also presents a substantial risk, with a 29% success rate, revealing the limitations of current paraphrase generalization in model editing.
Defense Strategies
The paper introduces several defense strategies aimed at improving the resilience of LLMs against these attack scenarios. The Max-Entropy Defense, which extends editing objectives to maximize intermediate representational entropy, demonstrates promising results in lowering whitebox attack success rates to 2.4%, a significant reduction compared to other strategies. However, no single defense emerged universally effective; for instance, the input rephrasing defense showed limited success against novel phrasings not included in the defense paradigm.
Implications and Future Directions
The implications of these findings are significant. The ability to effectively and irretrievably delete sensitive facts from LLMs impacts user privacy and safety in deploying AI systems. Failures to establish strong defenses against the sophisticated attack strategies underscore the ongoing cat-and-mouse nature of privacy-oriented model development. The research suggests that the notion of secure information deletion is inherently complex and fraught with challenges, emphasizing that model editing methods must continue to evolve rapidly to keep pace with novel extraction techniques.
This research opens several avenues for exploration, including refining model-editing algorithms, enhancing the cross-context generalization of modifications, and developing more rigorous testing protocols that can preemptively address potential privacy violations. As AI systems become further entrenched in societal infrastructure, ensuring their safe and private operation remains a paramount objective.