Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

What is in Your Safe Data? Identifying Benign Data that Breaks Safety (2404.01099v2)

Published 1 Apr 2024 in cs.LG, cs.AI, cs.CL, and cs.CR
What is in Your Safe Data? Identifying Benign Data that Breaks Safety

Abstract: Current LLMs, even those tuned for safety and alignment, are susceptible to jailbreaking. Some have found that just further fine-tuning an aligned model with benign data (i.e., data without harmful content) surprisingly leads to substantial degradation in safety. We delve into the data-centric aspects of why benign fine-tuning inadvertently contributes to jailbreaking. First, we represent fine-tuning data through two lenses: representation and gradient spaces. Additionally, we propose a bi-directional anchoring method that, during the selection process, prioritizes data points that are close to harmful examples and far from benign ones. Our approach effectively identifies subsets of benign data that are more likely to degrade the model's safety after fine-tuning. Training on just 100 of these seemingly benign datapoints surprisingly leads to the fine-tuned model affirmatively responding to >70% of tested harmful requests, compared to <20% after fine-tuning on randomly selected data. We also observe that the selected data frequently appear as lists, bullet points, or math questions, indicating a systematic pattern in fine-tuning data that contributes to jailbreaking.

Identifying Benign Data Prone to Facilitating Jailbreaking in LLMs Through Fine-Tuning

Introduction

LLMs, despite rigorous safety and alignment fine-tuning, are prone to producing harmful or misaligned content when further fine-tuned on seemingly benign data. This paper explores how benign fine-tuning can inadvertently compromise safety, proposing a data-centric approach to identify potentially harmful subsets within benign data. By examining the fine-tuning process through representation and gradient spaces and introducing a bi-directional anchoring method, this research sheds light on the characteristics of benign data that disproportionately degrade model safety upon fine-tuning. The findings suggest that even limited exposure to certain benign data can drastically increase a model's propensity to output harmful content.

Representational and Gradient-Based Data Characterization

The paper categorizes benign fine-tuning data through representational and gradient-based features to determine how closely they relate to known harmful examples. In representational matching, the final hidden states of model outputs serve to measure data similarity in the representational space. Alternatively, gradient matching leverages the directions in which the model parameters are updated during fine-tuning, hypothesizing that data points which lead to significant loss reduction on harmful examples could prompt safety degradation.

Bi-Directional Anchoring for Data Selection

A novel bi-directional anchoring approach is presented for gradient-based data selection, effectively anchoring data points between those closely resembling harmful examples and those diverging significantly from benign ones. This method allows for a more nuanced assessment of potential risk associated with fine-tuning on particular benign data points, highlighting the importance of considering both attraction to harmfulness and repulsion from safety in evaluating data.

Empirical Evaluations on Model Safety

Empirical results underscore the efficacy of the proposed methods in identifying harmful subsets within benign datasets. Fine-tuning on merely 100 carefully selected benign examples notably increased the model's likelihood of compliancy with harmful requests, demonstrating that these methods can significantly pinpoint data prone to undermining LLM safety. Specifically, fine-tuning with data chosen via representation matching and gradient matching notably elevated the Attack Success Rate (ASR) in the tested LLMs.

Analysis of Potentially Harmful Data Patterns

Further investigation into the data selected via the proposed methods uncovered the frequent presence of list and bullet-point formats, as well as mathematical questions within the potentially harmful subsets. This pattern suggests that not only the content but also the structural presentation of fine-tuning data influences the safety of the resulting models.

Reshaping Safe Fine-Tuning Practices

This paper’s outcomes portend significant implications for safe fine-tuning practices in AI development. By providing insights into the characteristics of benign data that could lead to safety degradation, AI practitioners can refine data selection processes for fine-tuning, mitigating risks associated with unintentionally compromising model safety. Moreover, the presented approach for identifying potentially harmful benign data presents a new avenue for developing more robust safety evaluations and fine-tuning protocols.

Conclusion

The research presented in this paper highlights the nuanced and sometimes counterintuitive ways in which benign data can facilitate the degradation of safety in LLMs during fine-tuning. Through a detailed analysis of fine-tuning data in both representation and gradient spaces and the introduction of a novel bi-directional anchoring method, this work not only elucidates the mechanisms behind this phenomenon but also provides practical tools for identifying and mitigating risks. As LLMs continue to be fine-tuned for a myriad of applications, understanding and addressing the potential for benign data to compromise model safety will be paramount for ethical and responsible AI development.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Bullseye polytope: A scalable clean-label poisoning attack with improved transferability. In 2021 IEEE European symposium on security and privacy (EuroS&P), pp.  159–178. IEEE, 2021.
  2. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
  3. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022b.
  4. Instruction mining: High-quality instruction data selection for large language models. arXiv preprint arXiv:2307.06290, 2023.
  5. Poison attacks against text datasets with conditional adversarially regularized autoencoder. arXiv preprint arXiv:2010.02684, 2020.
  6. Alpagasus: Training a better alpaca with fewer data. arXiv preprint arXiv:2307.08701, 2023.
  7. Targeted backdoor attacks on deep learning systems using data poisoning, 2017.
  8. Training verifiers to solve math word problems, 2021.
  9. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023.
  10. Mods: Model-oriented data selection for instruction tuning. arXiv preprint arXiv:2311.15653, 2023.
  11. Dsdm: Model-aware dataset selection with datamodels, 2024.
  12. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.
  13. Witches’ brew: Industrial scale data poisoning via gradient matching. arXiv preprint arXiv:2009.02276, 2020.
  14. Catastrophic jailbreak of open-source llms via exploiting generation. arXiv preprint arXiv:2310.06987, 2023.
  15. Subpopulation data poisoning attacks. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pp.  3104–3122, 2021.
  16. Grad-match: Gradient matching based data subset selection for efficient deep model training, 2021.
  17. Trojaning attack on neural networks. In 25th Annual Network And Distributed System Security Symposium (NDSS 2018). Internet Soc, 2018.
  18. Coresets for data-efficient training of machine learning models. In International Conference on Machine Learning, pp.  6950–6960. PMLR, 2020.
  19. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  20. Estimating training data influence by tracing gradient descent. Advances in Neural Information Processing Systems, 33:19920–19930, 2020.
  21. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693, 2023.
  22. {{\{{Explanation-Guided}}\}} backdoor poisoning attacks against malware classifiers. In 30th USENIX security symposium (USENIX security 21), pp.  1487–1504, 2021.
  23. Poison forensics: Traceback of data poisoning attacks in neural networks. In 31st USENIX Security Symposium (USENIX Security 22), pp.  3575–3592, 2022.
  24. On the exploitability of instruction tuning. arXiv preprint arXiv:2306.17194, 2023.
  25. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  26. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  27. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  28. Concealed data poisoning attacks on nlp models. arXiv preprint arXiv:2010.12563, 2020.
  29. Poisoning language models during instruction tuning. arXiv preprint arXiv:2305.00944, 2023.
  30. Jailbroken: How does llm safety training fail? In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  31. Less: Selecting influential data for targeted instruction tuning. arXiv preprint arXiv:2402.04333, 2024.
  32. Gradsafe: Detecting unsafe prompts for llms via safety-critical gradient analysis, 2024.
  33. Shadow alignment: The ease of subverting safely-aligned language models. arXiv preprint arXiv:2310.02949, 2023.
  34. Latent backdoor attacks on deep neural networks. In Proceedings of the 2019 ACM SIGSAC conference on computer and communications security, pp.  2041–2055, 2019.
  35. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. arXiv preprint arXiv:2401.06373, 2024.
  36. Removing rlhf protections in gpt-4 via fine-tuning. arXiv preprint arXiv:2311.05553, 2023.
  37. On prompt-driven safeguarding for large language models, 2024.
  38. Transferable clean-label poisoning attacks on deep neural nets. In International Conference on Machine Learning, pp.  7614–7623. PMLR, 2019.
  39. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Luxi He (9 papers)
  2. Mengzhou Xia (34 papers)
  3. Peter Henderson (67 papers)
Citations (22)