Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 94 tok/s
Gemini 2.5 Pro 57 tok/s Pro
GPT-5 Medium 28 tok/s
GPT-5 High 38 tok/s Pro
GPT-4o 100 tok/s
GPT OSS 120B 461 tok/s Pro
Kimi K2 208 tok/s Pro
2000 character limit reached

Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation (2409.01586v4)

Published 3 Sep 2024 in cs.CL and cs.AI

Abstract: Harmful fine-tuning attack poses serious safety concerns for LLMs' fine-tuning-as-a-service. While existing defenses have been proposed to mitigate the issue, their performances are still far away from satisfactory, and the root cause of the problem has not been fully recovered. To this end, we in this paper show that harmful perturbation over the model weights could be a probable cause of alignment-broken. In order to attenuate the negative impact of harmful perturbation, we propose an alignment-stage solution, dubbed Booster. Technically, along with the original alignment loss, we append a loss regularizer in the alignment stage's optimization. The regularizer ensures that the model's harmful loss reduction after the simulated harmful perturbation is attenuated, thereby mitigating the subsequent fine-tuning risk. Empirical results show that Booster can effectively reduce the harmful score of the fine-tuned models while maintaining the performance of downstream tasks. Our code is available at https://github.com/git-disl/Booster.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  2. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  3. Safe rlhf: Safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773, 2023.
  4. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023.
  5. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pp.  1126–1135. PMLR, 2017.
  6. Safe lora: the silver lining of reducing safety risks when fine-tuning large language models, 2024.
  7. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  8. Antidote: Post-fine-tuning safety alignment for large language models against harmful fine-tuning. arXiv preprint arXiv:2408.09600, 2024a.
  9. Lazy safety alignment for large language models against harmful fine-tuning. arXiv preprint arXiv:2405.18641, 2024b.
  10. Vaccine: Perturbation-aware alignment for large language model. arXiv preprint arXiv:2402.01109, 2024c.
  11. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. arXiv preprint arXiv:2307.04657, 2023.
  12. No two devils alike: Unveiling distinct mechanisms of fine-tuning attacks. arXiv preprint arXiv:2405.16229, 2024.
  13. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023.
  14. Fixing weight decay regularization in adam. arXiv preprint arXiv:1711.05101, 5, 2017.
  15. Keeping llms aligned after fine-tuning: The crucial role of prompt templates. arXiv preprint arXiv:2402.18540, 2024.
  16. Fine-tuning can cripple your foundation model; preserving features may be the solution. arXiv preprint arXiv:2308.13320, 2023.
  17. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  18. Navigating the safety landscape: Measuring risks in finetuning large language models. arXiv preprint arXiv:2405.17374, 2024.
  19. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693, 2023.
  20. Safety alignment should be made more than just a few tokens deep. arXiv preprint arXiv:2406.05946, 2024.
  21. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  22. Meta-learning with implicit gradients. Advances in neural information processing systems, 32, 2019.
  23. Representation noising effectively prevents harmful fine-tuning on llms. arXiv preprint arXiv:2405.14577, 2024a.
  24. Immunization against harmful fine-tuning attacks. arXiv preprint arXiv:2402.16382, 2024b.
  25. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp.  1631–1642, 2013.
  26. Tamper-resistant safeguards for open-weight llms. arXiv preprint arXiv:2408.00761, 2024.
  27. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
  28. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  29. Mitigating fine-tuning jailbreak attack with backdoor enhanced alignment. arXiv preprint arXiv:2402.14968, 2024.
  30. Pairwise proximal policy optimization: Harnessing relative feedback for llm alignment. arXiv preprint arXiv:2310.00212, 2023.
  31. Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024.
  32. Shadow alignment: The ease of subverting safely-aligned language models. arXiv preprint arXiv:2310.02949, 2023.
  33. A safety realignment framework via subspace-oriented model fusion for large language models. arXiv preprint arXiv:2405.09055, 2024.
  34. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023.
  35. Removing rlhf protections in gpt-4 via fine-tuning. arXiv preprint arXiv:2311.05553, 2023.
  36. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28, 2015.
Citations (8)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube