Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
52 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
15 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
Gemini 2.5 Flash Deprecated
12 tokens/sec
2000 character limit reached

Immunization against harmful fine-tuning attacks (2402.16382v2)

Published 26 Feb 2024 in cs.CL

Abstract: LLMs are often trained with safety guards intended to prevent harmful text generation. However, such safety training can be removed by fine-tuning the LLM on harmful datasets. While this emerging threat (harmful fine-tuning attacks) has been characterized by previous work, there is little understanding of how we should proceed in constructing and validating defenses against these attacks especially in the case where defenders would not have control of the fine-tuning process. We introduce a formal framework based on the training budget of an attacker which we call "Immunization" conditions. Using a formal characterisation of the harmful fine-tuning problem, we provide a thorough description of what a successful defense must comprise of and establish a set of guidelines on how rigorous defense research that gives us confidence should proceed.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Deep-lock: Secure authorization for deep neural networks. arXiv preprint arXiv:2008.05966, 2020.
  2. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International conference on machine learning, pp.  274–283. PMLR, 2018.
  3. Spinning language models: Risks of propaganda-as-a-service and countermeasures. In 2022 IEEE Symposium on Security and Privacy (SP). IEEE, May 2022. doi: 10.1109/sp46214.2022.9833572. URL http://dx.doi.org/10.1109/SP46214.2022.9833572.
  4. Leace: Perfect linear concept erasure in closed form. Advances in Neural Information Processing Systems, 36, 2024.
  5. Language model unalignment: Parametric red-teaming to expose hidden harms and biases. arXiv preprint arXiv:2310.14303, 2023.
  6. Language models are homer simpson! safety re-alignment of fine-tuned language models through task arithmetic, 2024.
  7. Strong data augmentation sanitizes poisoning and backdoor attacks without an accuracy tradeoff. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  3855–3859. IEEE, 2021.
  8. Stealthy and persistent unalignment on large language models via backdoor injections, 2023.
  9. Attacks, defenses and evaluations for llm conversation safety: A survey. arXiv preprint arXiv:2402.09283, 2024.
  10. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pp.  1126–1135. PMLR, 2017.
  11. Badllama: cheaply removing safety fine-tuning from llama 2-chat 13b. arXiv preprint arXiv:2311.00117, 2023.
  12. The GEM benchmark: Natural language generation, its evaluation and metrics. In Antoine Bosselut, Esin Durmus, Varun Prashant Gangal, Sebastian Gehrmann, Yacine Jernite, Laura Perez-Beltrachini, Samira Shaikh, and Wei Xu (eds.), Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021), pp.  96–120, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.gem-1.10. URL https://aclanthology.org/2021.gem-1.10.
  13. What doesn’t kill you makes you robust (er): How to adversarially train against data poisoning. arXiv preprint arXiv:2102.13624, 2021.
  14. Towards deep neural network architectures robust to adversarial examples. arXiv preprint arXiv:1412.5068, 2014.
  15. Julian Hazell. Large language models can be used to effectively scale spear phishing campaigns. arXiv preprint arXiv:2305.06972, 2023.
  16. Self-destructing models: Increasing the costs of harmful dual uses of foundation models. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, pp.  287–296, 2023.
  17. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  18. Vaccine: Perturbation-aware alignment for large language model. arXiv preprint arXiv:2402.01109, 2024.
  19. Sleeper agents: Training deceptive llms that persist through safety training, 2024.
  20. Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks, 2023.
  21. Beavertails: Towards improved safety alignment of llm via a human-preference dataset, 2023.
  22. A mechanistic understanding of alignment algorithms: A case study on dpo and toxicity. ArXiv, abs/2401.01967, 2024. URL https://api.semanticscholar.org/CorpusID:266755904.
  23. Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b. arXiv preprint arXiv:2310.20624, 2023.
  24. Badedit: Backdooring large language models by model editing. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=duZANm2ABX.
  25. Chaotic weights: A novel approach to protect intellectual property of deep neural networks. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 40(7):1327–1339, 2020.
  26. Attacks on recent dnn ip protection techniques and their mitigation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2023.
  27. On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999, 2018.
  28. Exploiting novel gpt-4 apis. arXiv preprint arXiv:2312.14302, 2023.
  29. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693, 2023.
  30. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  31. Universal jailbreak backdoors from poisoned human feedback. In The Twelfth International Conference on Learning Representations, 2023.
  32. Linear adversarial concept erasure. In International Conference on Machine Learning, pp.  18400–18421. PMLR, 2022.
  33. On the exploitability of instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  34. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  35. Poisoning language models during instruction tuning. 2023.
  36. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models, 2024.
  37. Domain specified optimization for deployment authorization. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp.  5072–5082. IEEE, 2023.
  38. Non-Transferable Learning: A New Approach for Model Ownership Verification and Applicability Authorization. October 2021. URL https://openreview.net/forum?id=tYRrOdSnVUy.
  39. Assessing the brittleness of safety alignment via pruning and low-rank modifications. arXiv preprint arXiv:2402.05162, 2024.
  40. Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models, 2023.
  41. Advparams: An active dnn intellectual property protection technique via adversarial perturbation based parameter encryption. IEEE Transactions on Emerging Topics in Computing, 2022.
  42. Activeguard: An active intellectual property protection technique for deep neural networks by leveraging adversarial examples as users’ fingerprints. IET Computers & Digital Techniques, 2023.
  43. Shadow alignment: The ease of subverting safely-aligned language models. arXiv preprint arXiv:2310.02949, 2023.
  44. Open-source can be dangerous: On the vulnerability of value alignment in open-source LLMs, 2024. URL https://openreview.net/forum?id=NIouO0C0ex.
  45. Learning fair representations. In International conference on machine learning, pp.  325–333. PMLR, 2013.
  46. Removing rlhf protections in gpt-4 via fine-tuning. arXiv preprint arXiv:2311.05553, 2023.
  47. Learning and forgetting unsafe examples in large language models. arXiv preprint arXiv:2312.12736, 2023.
  48. Making harmful behaviors unlearnable for large language models. arXiv preprint arXiv:2311.02105, 2023.
Citations (12)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com