Evaluating the Safety of Finetuning in LLMs
This paper addresses the critical issue of safety alignment in LLMs, focusing on how easily this can be compromised through finetuning with adversarial examples. The authors propose a novel concept known as the "safety basin," which describes how random perturbations in model weights can maintain the safety level of a model, suggesting a form of inherent robustness in the local parameter space of popular open-source LLMs.
The paper introduces a new safety metric, termed the \method{} metric, which evaluates the safety of an LLM by mapping its safety landscape. This method allows the visualization of how changes in model parameters, especially during finetuning, affect the model's adherence to safety protocols. Specifically, the safety basin highlights the critical role of system prompts in maintaining model safety, revealing that model variants, obtained by slight perturbations, can retain the safety characteristics of the original model.
Intriguingly, the results indicate that while LLMs such as GPT-3.5 Turbo and LLaMA-2 are capable of maintaining safety in certain neighborhoods of parameter space, a small set of adversarial finetuning data can rapidly dismantle these defenses. For example, GPT-3.5 Turbo and LLaMA-2 both fail to consistently reject harmful prompts when finetuned with merely 10 harmful examples, illustrating significant security vulnerabilities in the alignment process. These vulnerabilities underscore a practical concern for deploying LLMs, particularly when customized for specific applications with potentially malicious finetuning data.
The authors' exploration of the LLM safety landscape reveals several key insights:
- Safety Basin Consistency: Across various LLMs like LLaMA-2, LLaMA-3, Vicuna, and Mistral, the existence of a safety basin implies that randomly perturbing model weights within a certain range does not degrade model safety.
- Finetuning Vulnerability: The paper finds that different LLMs exhibit varying degrees of vulnerability to finetuning. The task-agnostic \method{} metric successfully anticipates these risks without being dependent on the finetuning dataset's specifics.
- System Prompt Efficacy: The analysis highlights how the system prompt affects model safety, demonstrating its critical role. For instance, the LLaMA-2 system prompt enhances model safety across several models, and tailored prompts optimized for specific safety goals also proved effective.
- Susceptibility to Jailbreak Attacks: The paper shows that adversarial prompt attacks, or "jailbreaks," are sensitive to parameter perturbations. This finding suggests potential defense strategies, such as using perturbed models for response generation to mitigate attack success rates.
These insights guide future research directions in LLM safety, emphasizing the need for robust safety mechanisms that withstand adversarial finetuning and prompts. The identification of safety basins and effective system prompts could inform the development of more resilient LLM deployments, while the proposed \method{} metric provides a tool for preemptive risk assessment in model customization tasks.
In conclusion, this investigation into the safety landscape of LLMs reveals significant vulnerabilities in established alignment techniques, offering new perspectives on ensuring reliable model performance under adversarial conditions. The development of the \method{} metric and its application to finetuning tasks represent significant strides in understanding and improving model safety in practical AI systems. Future work will likely explore optimizing these safety basins and extending the insights to additional LLM architectures and deployment scenarios.