Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A safety realignment framework via subspace-oriented model fusion for large language models (2405.09055v1)

Published 15 May 2024 in cs.CL

Abstract: The current safeguard mechanisms for LLMs are indeed susceptible to jailbreak attacks, making them inherently fragile. Even the process of fine-tuning on apparently benign data for downstream tasks can jeopardize safety. One potential solution is to conduct safety fine-tuning subsequent to downstream fine-tuning. However, there's a risk of catastrophic forgetting during safety fine-tuning, where LLMs may regain safety measures but lose the task-specific knowledge acquired during downstream fine-tuning. In this paper, we introduce a safety realignment framework through subspace-oriented model fusion (SOMF), aiming to combine the safeguard capabilities of initially aligned model and the current fine-tuned model into a realigned model. Our approach begins by disentangling all task vectors from the weights of each fine-tuned model. We then identify safety-related regions within these vectors by subspace masking techniques. Finally, we explore the fusion of the initial safely aligned LLM with all task vectors based on the identified safety subspace. We validate that our safety realignment framework satisfies the safety requirements of a single fine-tuned model as well as multiple models during their fusion. Our findings confirm that SOMF preserves safety without notably compromising performance on downstream tasks, including instruction following in Chinese, English, and Hindi, as well as problem-solving capabilities in Code and Math.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Llama: Open and efficient foundation language models, arXiv preprint arXiv:2302.13971 (2023a). URL: https://arxiv.org/pdf/2302.13971.
  2. Llama 2: Open foundation and fine-tuned chat models, arXiv preprint arXiv:2307.09288 (2023b). URL: https://arxiv.org/pdf/2307.09288.
  3. Phi-3 technical report: A highly capable language model locally on your phone, arXiv preprint arXiv:2404.14219 (2024). URL: https://arxiv.org/pdf/2404.14219.
  4. Glm-130b: An open bilingual pre-trained model, in: International Conference on Learning Representations, 2022. URL: https://openreview.net/pdf?id=-Aw0rrrPUF.
  5. Fine-tuning aligned language models compromises safety, even when users do not intend to!, in: International Conference on Learning Representations, 2023. URL: https://openreview.net/pdf?id=hTEGyKf0dZ.
  6. Pretraining language models with human preferences, in: International Conference on Machine Learning, 2023, pp. 17506–17533. URL: https://openreview.net/pdf?id=AT8Iw8KOeC.
  7. Training language models to follow instructions with human feedback, in: Advances in Neural Information Processing Systems, volume 35, 2022, pp. 27730–27744. URL: https://openreview.net/pdf?id=TG8KACxEON.
  8. Direct preference optimization: Your language model is secretly a reward model, in: Advances in Neural Information Processing Systems, volume 36, 2023, pp. 53728–53741. URL: https://proceedings.neurips.cc/paper_files/paper/2023/file/a85b405ed65c6477a4fe8302b5e06ce7-Paper-Conference.pdf.
  9. Direct large language model alignment through self-rewarding contrastive prompt distillation, arXiv preprint arXiv:2402.11907 (2024). URL: https://arxiv.org/abs/2402.11907.
  10. A mechanistic understanding of alignment algorithms: A case study on dpo and toxicity, arXiv preprint arXiv:2401.01967 (2024). URL: https://arxiv.org/abs/2401.01967.
  11. A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily, arXiv preprint arXiv:2311.08268 (2023). URL: https://arxiv.org/abs/2311.08268.
  12. Advancing the robustness of large language models through self-denoised smoothing, arXiv preprint arXiv:2404.12274 (2024). URL: https://arxiv.org/pdf/2404.12274.
  13. Smoothllm: Defending large language models against jailbreaking attacks, arXiv preprint arXiv:2310.03684 (2023). URL: https://arxiv.org/pdf/2310.03684.
  14. Learning and forgetting unsafe examples in large language models, arXiv preprint arXiv:2312.12736 (2023). URL: https://arxiv.org/abs/2312.12736.
  15. Vaccine: Perturbation-aware alignment for large language model, arXiv preprint arXiv:2402.01109 (2024). URL: https://arxiv.org/abs/2402.01109.
  16. Training a helpful and harmless assistant with reinforcement learning from human feedback, arXiv preprint arXiv:2204.05862 (2022). URL: https://arxiv.org/pdf/2204.05862.
  17. Overcoming catastrophic forgetting in neural networks, Proceedings of the national academy of sciences 114 (2017) 3521–3526. URL: https://www.pnas.org/doi/abs/10.1073/pnas.1611835114.
  18. Language models are homer simpson! safety re-alignment of fine-tuned language models through task arithmetic, arXiv preprint arXiv:2402.11746 (2024). URL: https://arxiv.org/pdf/2402.11746.
  19. Language models are super mario: Absorbing abilities from homologous models as a free lunch, arXiv preprint arXiv:2311.03099 (2023). URL: https://arxiv.org/pdf/2311.03099.
  20. Assessing the brittleness of safety alignment via pruning and low-rank modifications, arXiv preprint arXiv:2402.05162 (2024). URL: https://arxiv.org/pdf/2402.05162.
  21. Composing parameter-efficient modules with arithmetic operation, Advances in Neural Information Processing Systems 36 (2023). URL: https://openreview.net/pdf?id=5r3e27I9Gy.
  22. Dataless knowledge fusion by merging weights of language models, in: The Eleventh International Conference on Learning Representations, 2022. URL: https://openreview.net/pdf?id=FCnohuR6AnM.
  23. Concrete subspace learning based interference elimination for multi-task model fusion, arXiv preprint arXiv:2312.06173 (2023).
  24. A general language assistant as a laboratory for alignment, arXiv preprint arXiv:2112.00861 (2021). URL: https://arxiv.org/abs/2112.00861.
  25. Safe rlhf: Safe reinforcement learning from human feedback, in: The Twelfth International Conference on Learning Representations, 2023. URL: https://openreview.net/pdf?id=TyFrPOKYXw.
  26. Self-rewarding language models, arXiv preprint arXiv:2401.10020 (2024). URL: https://arxiv.org/pdf/2401.10020.
  27. J. Howard, S. Ruder, Universal language model fine-tuning for text classification, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018, pp. 328–339. URL: https://aclanthology.org/P18-1031/.
  28. Full parameter fine-tuning for large language models with limited resources, arXiv preprint arXiv:2306.09782 (2023). URL: https://arxiv.org/abs/2306.09782.
  29. Lora: Low-rank adaptation of large language models, in: International Conference on Learning Representations, 2021. URL: https://openreview.net/forum?id=nZeVKeeFYf9.
  30. Dora: Weight-decomposed low-rank adaptation, arXiv preprint arXiv:2402.09353 (2024). URL: https://arxiv.org/abs/2402.09353.
  31. Reft: Representation finetuning for language models, arXiv preprint arXiv:2404.03592 (2024). URL: https://arxiv.org/abs/2404.03592.
  32. A. Üstün, A. C. Stickland, When does parameter-efficient transfer learning work for machine translation?, in: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 7919–7933. URL: https://aclanthology.org/2022.emnlp-main.540.pdf.
  33. A. Gui, H. Xiao, Hifi: High-information attention heads hold for parameter-efficient model adaptation, in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 8521–8537. URL: https://aclanthology.org/2023.acl-long.475.pdf.
  34. Shadow alignment: The ease of subverting safely-aligned language models, arXiv preprint arXiv:2310.02949 (2023). URL: https://arxiv.org/pdf/2310.02949.
  35. Y. Zhang, Q. Yang, A survey on multi-task learning, IEEE Transactions on Knowledge and Data Engineering 34 (2021) 5586–5609. URL: https://ieeexplore.ieee.org/document/9392366.
  36. Efficiently identifying task groupings for multi-task learning, Advances in Neural Information Processing Systems 34 (2021) 27503–27516. URL: https://proceedings.nips.cc/paper/2021/file/e77910ebb93b511588557806310f78f1-Paper.pdf.
  37. Editing models with task arithmetic, in: The Eleventh International Conference on Learning Representations, 2023. URL: https://openreview.net/pdf?id=6t0Kwf8-jrj.
  38. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time, in: International conference on machine learning, 2022, pp. 23965–23998. URL: https://proceedings.mlr.press/v162/wortsman22a.html.
  39. Ties-merging: Resolving interference when merging models, in: Advances in Neural Information Processing Systems, 2023. URL: https://openreview.net/pdf?id=xtaX3WyCj1.
  40. Wizardlm: Empowering large language models to follow complex instructions, arXiv preprint arXiv:2304.12244 (2023). URL: https://arxiv.org/pdf/2304.12244.
  41. M. S. Matena, C. A. Raffel, Merging models with fisher-weighted averaging, Advances in Neural Information Processing Systems 35 (2022) 17703–17716. URL: https://openreview.net/pdf?id=LSKlp_aceOC.
  42. T. Hazan, T. S. Jaakkola, On the partition function and random maximum a-posteriori perturbations, in: Proceedings of the 29th International Conference on Machine Learning, 2012. URL: https://icml.cc/Conferences/2012/papers/528.pdf.
  43. A. Mnih, K. Gregor, Neural variational inference and learning in belief networks, in: International Conference on Machine Learning, 2014. URL: https://proceedings.mlr.press/v32/mnih14.html.
  44. The concrete distribution: A continuous relaxation of discrete random variables, in: International Conference on Learning Representations, 2016. URL: https://openreview.net/forum?id=S1jE5L5gl.
  45. Chinese tiny llm: Pretraining a chinese-centric large language model, arXiv preprint arXiv:2404.04167 (2024). URL: https://arxiv.org/abs/2404.04167.
  46. Llamafactory: Unified efficient fine-tuning of 100+ language models, arXiv preprint arXiv:2403.13372 (2024). URL: https://arxiv.org/pdf/2403.13372.
  47. Semeval-2012 task 7: Choice of plausible alternatives: An evaluation of commonsense causal reasoning, in: * SEM 2012: The First Joint Conference on Lexical and Computational Semantics, 2012, pp. 394–398. URL: https://aclanthology.org/S12-1052.pdf.
  48. Xcopa: A multilingual dataset for causal commonsense reasoning, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020, pp. 2362–2376. URL: https://aclanthology.org/2020.emnlp-main.185.pdf.
  49. Xnli: Evaluating cross-lingual sentence representations, in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 2475–2485. URL: https://aclanthology.org/D18-1269.pdf.
  50. Evaluating large language models trained on code, arXiv preprint arXiv:2107.03374 (2021). URL: https://arxiv.org/pdf/2107.03374.
  51. Training verifiers to solve math word problems, arXiv preprint arXiv:2110.14168 (2021). URL: https://arxiv.org/pdf/2110.14168.
  52. Beavertails: Towards improved safety alignment of llm via a human-preference dataset, Advances in Neural Information Processing Systems 36 (2024). URL: https://openreview.net/pdf?id=g0QovXbFw3.
  53. R. Bhardwaj, S. Poria, Red-teaming large language models using chain of utterances for safety-alignment, arXiv preprint arXiv:2308.09662 (2023). URL: https://arxiv.org/pdf/2308.09662.
  54. On second thought, let’s not think step by step! bias and toxicity in zero-shot reasoning, in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023, pp. 4454–4470. URL: https://aclanthology.org/2023.acl-long.244.pdf.
  55. Aligner: Achieving efficient alignment through weak-to-strong correction, arXiv preprint arXiv:2402.02416 (2024). URL: https://arxiv.org/pdf/2402.02416.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Xin Yi (37 papers)
  2. Shunfan Zheng (5 papers)
  3. Linlin Wang (35 papers)
  4. Xiaoling Wang (42 papers)
  5. Liang He (202 papers)
Citations (13)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets