What is in Your Safe Data? Identifying Benign Data that Breaks Safety (2404.01099v2)

Published 1 Apr 2024 in cs.LG, cs.AI, cs.CL, and cs.CR

Abstract: Current LLMs, even those tuned for safety and alignment, are susceptible to jailbreaking. Some have found that just further fine-tuning an aligned model with benign data (i.e., data without harmful content) surprisingly leads to substantial degradation in safety. We delve into the data-centric aspects of why benign fine-tuning inadvertently contributes to jailbreaking. First, we represent fine-tuning data through two lenses: representation and gradient spaces. Additionally, we propose a bi-directional anchoring method that, during the selection process, prioritizes data points that are close to harmful examples and far from benign ones. Our approach effectively identifies subsets of benign data that are more likely to degrade the model's safety after fine-tuning. Training on just 100 of these seemingly benign datapoints surprisingly leads to the fine-tuned model affirmatively responding to >70% of tested harmful requests, compared to <20% after fine-tuning on randomly selected data. We also observe that the selected data frequently appear as lists, bullet points, or math questions, indicating a systematic pattern in fine-tuning data that contributes to jailbreaking.

PDF HTML Abstract

Identifying Benign Data Prone to Facilitating Jailbreaking in LLMs Through Fine-Tuning

Introduction

LLMs, despite rigorous safety and alignment fine-tuning, are prone to producing harmful or misaligned content when further fine-tuned on seemingly benign data. This paper explores how benign fine-tuning can inadvertently compromise safety, proposing a data-centric approach to identify potentially harmful subsets within benign data. By examining the fine-tuning process through representation and gradient spaces and introducing a bi-directional anchoring method, this research sheds light on the characteristics of benign data that disproportionately degrade model safety upon fine-tuning. The findings suggest that even limited exposure to certain benign data can drastically increase a model's propensity to output harmful content.

Representational and Gradient-Based Data Characterization

The paper categorizes benign fine-tuning data through representational and gradient-based features to determine how closely they relate to known harmful examples. In representational matching, the final hidden states of model outputs serve to measure data similarity in the representational space. Alternatively, gradient matching leverages the directions in which the model parameters are updated during fine-tuning, hypothesizing that data points which lead to significant loss reduction on harmful examples could prompt safety degradation.

Bi-Directional Anchoring for Data Selection

A novel bi-directional anchoring approach is presented for gradient-based data selection, effectively anchoring data points between those closely resembling harmful examples and those diverging significantly from benign ones. This method allows for a more nuanced assessment of potential risk associated with fine-tuning on particular benign data points, highlighting the importance of considering both attraction to harmfulness and repulsion from safety in evaluating data.

Empirical Evaluations on Model Safety

Empirical results underscore the efficacy of the proposed methods in identifying harmful subsets within benign datasets. Fine-tuning on merely 100 carefully selected benign examples notably increased the model's likelihood of compliancy with harmful requests, demonstrating that these methods can significantly pinpoint data prone to undermining LLM safety. Specifically, fine-tuning with data chosen via representation matching and gradient matching notably elevated the Attack Success Rate (ASR) in the tested LLMs.

Analysis of Potentially Harmful Data Patterns

Further investigation into the data selected via the proposed methods uncovered the frequent presence of list and bullet-point formats, as well as mathematical questions within the potentially harmful subsets. This pattern suggests that not only the content but also the structural presentation of fine-tuning data influences the safety of the resulting models.

Reshaping Safe Fine-Tuning Practices

This paper’s outcomes portend significant implications for safe fine-tuning practices in AI development. By providing insights into the characteristics of benign data that could lead to safety degradation, AI practitioners can refine data selection processes for fine-tuning, mitigating risks associated with unintentionally compromising model safety. Moreover, the presented approach for identifying potentially harmful benign data presents a new avenue for developing more robust safety evaluations and fine-tuning protocols.

Conclusion

The research presented in this paper highlights the nuanced and sometimes counterintuitive ways in which benign data can facilitate the degradation of safety in LLMs during fine-tuning. Through a detailed analysis of fine-tuning data in both representation and gradient spaces and the introduction of a novel bi-directional anchoring method, this work not only elucidates the mechanisms behind this phenomenon but also provides practical tools for identifying and mitigating risks. As LLMs continue to be fine-tuned for a myriad of applications, understanding and addressing the potential for benign data to compromise model safety will be paramount for ethical and responsible AI development.

PDF Markdown Bookmark Chat (Pro)

References (39)

Authors (3)

Luxi He (9 papers)
Mengzhou Xia (34 papers)
Peter Henderson (67 papers)

Citations (22)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/LuxiHeLucy/status/1775960676441243871

https://twitter.com/fly51fly/status/1776973863219503418

https://twitter.com/Qnolan4/status/1895185139875094857

https://twitter.com/Qnolan4/status/1895201928386355387

https://twitter.com/Qnolan4/status/1895201401585967414

https://twitter.com/knishimae0531/status/1776212912442577226