Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models (2405.17374v3)

Published 27 May 2024 in cs.LG

Abstract: Safety alignment is crucial to ensure that LLMs behave in ways that align with human preferences and prevent harmful actions during inference. However, recent studies show that the alignment can be easily compromised through finetuning with only a few adversarially designed training examples. We aim to measure the risks in finetuning LLMs through navigating the LLM safety landscape. We discover a new phenomenon observed universally in the model parameter space of popular open-source LLMs, termed as "safety basin": random perturbations to model weights maintain the safety level of the original aligned model within its local neighborhood. However, outside this local region, safety is fully compromised, exhibiting a sharp, step-like drop. This safety basin contrasts sharply with the LLM capability landscape, where model performance peaks at the origin and gradually declines as random perturbation increases. Our discovery inspires us to propose the new VISAGE safety metric that measures the safety in LLM finetuning by probing its safety landscape. Visualizing the safety landscape of the aligned model enables us to understand how finetuning compromises safety by dragging the model away from the safety basin. The LLM safety landscape also highlights the system prompt's critical role in protecting a model, and that such protection transfers to its perturbed variants within the safety basin. These observations from our safety landscape research provide new insights for future work on LLM safety community. Our code is publicly available at https://github.com/ShengYun-Peng/LLM-landscape.

PDF HTML Abstract

Evaluating the Safety of Finetuning in LLMs

This paper addresses the critical issue of safety alignment in LLMs, focusing on how easily this can be compromised through finetuning with adversarial examples. The authors propose a novel concept known as the "safety basin," which describes how random perturbations in model weights can maintain the safety level of a model, suggesting a form of inherent robustness in the local parameter space of popular open-source LLMs.

The paper introduces a new safety metric, termed the \method{} metric, which evaluates the safety of an LLM by mapping its safety landscape. This method allows the visualization of how changes in model parameters, especially during finetuning, affect the model's adherence to safety protocols. Specifically, the safety basin highlights the critical role of system prompts in maintaining model safety, revealing that model variants, obtained by slight perturbations, can retain the safety characteristics of the original model.

Intriguingly, the results indicate that while LLMs such as GPT-3.5 Turbo and LLaMA-2 are capable of maintaining safety in certain neighborhoods of parameter space, a small set of adversarial finetuning data can rapidly dismantle these defenses. For example, GPT-3.5 Turbo and LLaMA-2 both fail to consistently reject harmful prompts when finetuned with merely 10 harmful examples, illustrating significant security vulnerabilities in the alignment process. These vulnerabilities underscore a practical concern for deploying LLMs, particularly when customized for specific applications with potentially malicious finetuning data.

The authors' exploration of the LLM safety landscape reveals several key insights:

Safety Basin Consistency: Across various LLMs like LLaMA-2, LLaMA-3, Vicuna, and Mistral, the existence of a safety basin implies that randomly perturbing model weights within a certain range does not degrade model safety.
Finetuning Vulnerability: The paper finds that different LLMs exhibit varying degrees of vulnerability to finetuning. The task-agnostic \method{} metric successfully anticipates these risks without being dependent on the finetuning dataset's specifics.
System Prompt Efficacy: The analysis highlights how the system prompt affects model safety, demonstrating its critical role. For instance, the LLaMA-2 system prompt enhances model safety across several models, and tailored prompts optimized for specific safety goals also proved effective.
Susceptibility to Jailbreak Attacks: The paper shows that adversarial prompt attacks, or "jailbreaks," are sensitive to parameter perturbations. This finding suggests potential defense strategies, such as using perturbed models for response generation to mitigate attack success rates.

These insights guide future research directions in LLM safety, emphasizing the need for robust safety mechanisms that withstand adversarial finetuning and prompts. The identification of safety basins and effective system prompts could inform the development of more resilient LLM deployments, while the proposed \method{} metric provides a tool for preemptive risk assessment in model customization tasks.

In conclusion, this investigation into the safety landscape of LLMs reveals significant vulnerabilities in established alignment techniques, offering new perspectives on ensuring reliable model performance under adversarial conditions. The development of the \method{} metric and its application to finetuning tasks represent significant strides in understanding and improving model safety in practical AI systems. Future work will likely explore optimizing these safety basins and extending the insights to additional LLM architectures and deployment scenarios.