Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space (2402.09063v1)

Published 14 Feb 2024 in cs.LG

Abstract: Current research in adversarial robustness of LLMs focuses on discrete input manipulations in the natural language space, which can be directly transferred to closed-source models. However, this approach neglects the steady progression of open-source models. As open-source models advance in capability, ensuring their safety also becomes increasingly imperative. Yet, attacks tailored to open-source LLMs that exploit full model access remain largely unexplored. We address this research gap and propose the embedding space attack, which directly attacks the continuous embedding representation of input tokens. We find that embedding space attacks circumvent model alignments and trigger harmful behaviors more efficiently than discrete attacks or model fine-tuning. Furthermore, we present a novel threat model in the context of unlearning and show that embedding space attacks can extract supposedly deleted information from unlearned LLMs across multiple datasets and models. Our findings highlight embedding space attacks as an important threat model in open-source LLMs. Trigger Warning: the appendix contains LLM-generated text with violence and harassment.

PDF Abstract

Overview of "Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space"

In "Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space," the authors Schwinn, Dobre, Xhonneux, Gidel, and Günnemann introduce a novel methodology for adversarial attacks on open-source LLMs. The paper diverges from traditional discrete token attacks, presenting embedding space attacks targeting the continuous representation of input tokens directly. This approach is particularly significant for open-source models where adversaries can exploit unrestricted access to model weights and embeddings, posing unique challenges for safety and alignment.

Methodological Insights

The research proposes embedding space attacks which maintain static model weights and attack the vector representation of input tokens. This contrasts with existing approaches that rely on discrete manipulations, which are typically restricted to API interactions. The authors meticulously outline the attack mechanism, employing signed gradient descent for optimization, which they found to be the most stable among other methods such as regular gradient descent and ADAM. The embedding space attack demonstrated computational efficacy by circumventing the model's safety alignment to trigger harmful responses significantly faster than prior fine-tuning methods.

Empirical Findings

Safety Alignment Removal: The paper provides compelling evidence that embedding attacks can disrupt safety alignments in LLMs across diverse models like Llama2, Vicuna, and Mistral. Experiments indicate that embedding space attacks achieve higher success rates than other adversarial methods, with the additional benefit of reduced computational overhead. The ability to scale these attacks efficiently presents notable implications for the robustness and reliability of AI systems, particularly for applications where security is paramount.

Data Recovery from Unlearned Models: Embedding space attacks also demonstrate the potential to retrieve notions from models post-unlearning, posing questions about the effectiveness and reliability of current unlearning approaches. Through evaluations on the Llama2-7b-WhoIsHarryPotter model, the attack identified residual knowledge supposedly unlearned, effectively showing that complete unlearning remains a complex challenge. The discovery that models trained to forget specific datasets can still divulge sensitive information under specific adversarial conditions stresses the vulnerabilities inherent in current unlearning strategies.

Conclusion and Implications

The enumeration of embedding space attacks as a threat underscores the need to reassess security paradigms in open-source LLM environments. As the paper clarifies, although embedding space attacks can challenge safety measures, they also serve as an investigative tool to diagnose and address model vulnerabilities pre-deployment. Moreover, the insights into adversarial unlearning bolster the necessity for developing rigorous unlearning methodologies capable of securely purging sensitive data.

Future directions in AI not only demand improved robustness mechanisms against embedding space attacks but also necessitate comprehensive frameworks to fully understand the attack dynamics. Tackling these adversarial avenues will ensure that LLMs can be safely integrated into systems without exposing them to potential exploitation.

Speculation on Future Work

Exploration into mitigating embedding attack vulnerabilities will likely focus on enhancing robustness techniques and creating sophisticated defenses. Continued refinement of model architectures and training protocols may offer pathways to fortifying LLMs against such overbearing adversarial threats. Additionally, advancing unlearning processes to ensure models cannot reveal forgotten information remains critical. Leveraging advancements in regularization and security-awareness during model training and deployment could potentially address these concerns, ensuring AI progresses toward more secure integrations across applications.