Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks (2405.16229v1)

Published 25 May 2024 in cs.CL and cs.CR

Abstract: The existing safety alignment of LLMs is found fragile and could be easily attacked through different strategies, such as through fine-tuning on a few harmful examples or manipulating the prefix of the generation results. However, the attack mechanisms of these strategies are still underexplored. In this paper, we ask the following question: \textit{while these approaches can all significantly compromise safety, do their attack mechanisms exhibit strong similarities?} To answer this question, we break down the safeguarding process of an LLM when encountered with harmful instructions into three stages: (1) recognizing harmful instructions, (2) generating an initial refusing tone, and (3) completing the refusal response. Accordingly, we investigate whether and how different attack strategies could influence each stage of this safeguarding process. We utilize techniques such as logit lens and activation patching to identify model components that drive specific behavior, and we apply cross-model probing to examine representation shifts after an attack. In particular, we analyze the two most representative types of attack approaches: Explicit Harmful Attack (EHA) and Identity-Shifting Attack (ISA). Surprisingly, we find that their attack mechanisms diverge dramatically. Unlike ISA, EHA tends to aggressively target the harmful recognition stage. While both EHA and ISA disrupt the latter two stages, the extent and mechanisms of their attacks differ significantly. Our findings underscore the importance of understanding LLMs' internal safeguarding process and suggest that diverse defense mechanisms are required to effectively cope with various types of attacks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Jailbreaking leading safety-aligned llms with simple adaptive attacks. ArXiv preprint, abs/2404.02151, 2024.
  2. Training a helpful and harmless assistant with reinforcement learning from human feedback. ArXiv preprint, abs/2204.05862, 2022.
  3. Constitutional ai: Harmlessness from ai feedback. ArXiv preprint, abs/2212.08073, 2022.
  4. Eliciting latent predictions from transformers with the tuned lens. ArXiv preprint, abs/2303.08112, 2023.
  5. Mechanistic interpretability for ai safety–a review. ArXiv preprint, abs/2404.14082, 2024.
  6. Language models are homer simpson! safety re-alignment of fine-tuned language models through task arithmetic. ArXiv preprint, abs/2402.11746, 2024.
  7. Language model unalignment: Parametric red-teaming to expose hidden harms and biases. ArXiv preprint, abs/2310.14303, 2023.
  8. Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions. In The Twelfth International Conference on Learning Representations, 2023.
  9. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. ArXiv preprint, abs/2404.01318, 2024.
  10. Free dolly: Introducing the world’s first truly open instruction-tuned llm. Databricks Blog, 2023.
  11. Safe rlhf: Safe reinforcement learning from human feedback. In The Twelfth International Conference on Learning Representations, 2023.
  12. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1:1, 2021.
  13. Dissecting recall of factual associations in auto-regressive language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12216–12235, 2023.
  14. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 30–45, Abu Dhabi, United Arab Emirates, 2022. Association for Computational Linguistics.
  15. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics.
  16. Successor heads: Recurring, interpretable attention heads in the wild. In The Twelfth International Conference on Learning Representations, 2023.
  17. Finding neurons in a haystack: Case studies with sparse probing. Transactions on Machine Learning Research, 2023.
  18. Language models represent space and time. In The Twelfth International Conference on Learning Representations, 2023.
  19. How does gpt-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model. Advances in Neural Information Processing Systems, 36, 2024.
  20. Llm self defense: By self examination, llms know they are being tricked. ArXiv preprint, abs/2308.07308, 2023.
  21. Self-destructing models: Increasing the costs of harmful dual uses of foundation models. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, pages 287–296, 2023.
  22. The curious case of neural text degeneration. In International Conference on Learning Representations, 2019.
  23. Vaccine: Perturbation-aware alignment for large language model. ArXiv preprint, abs/2402.01109, 2024.
  24. Llama guard: Llm-based input-output safeguard for human-ai conversations. ArXiv preprint, abs/2312.06674, 2023.
  25. Baseline defenses for adversarial attacks against aligned language models. ArXiv preprint, abs/2309.00614, 2023.
  26. Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks. In The Twelfth International Conference on Learning Representations, 2024.
  27. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. Advances in Neural Information Processing Systems, 36, 2024.
  28. Pretraining language models with human preferences. In International Conference on Machine Learning, pages 17506–17533. PMLR, 2023.
  29. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, 2023.
  30. Self-detoxifying language models via toxification reversal. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
  31. Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b. ArXiv preprint, abs/2310.20624, 2023.
  32. Inference-time intervention: Eliciting truthful answers from a language model. Advances in Neural Information Processing Systems, 36, 2024.
  33. Amplegcg: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms. ArXiv preprint, abs/2404.07921, 2024.
  34. AutoDAN: Generating stealthy jailbreak prompts on aligned large language models. In The Twelfth International Conference on Learning Representations, 2024.
  35. Keeping llms aligned after fine-tuning: The crucial role of prompt templates. ArXiv preprint, abs/2402.18540, 2024.
  36. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. ArXiv preprint, abs/2310.06824, 2023.
  37. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249, 2024.
  38. Copy suppression: Comprehensively understanding an attention head. ArXiv preprint, abs/2310.04625, 2023.
  39. nostalgebraist. interpreting gpt: the logit lens. LessWrong, 2020.
  40. Fine-tuning enhances existing mechanisms: A case study on entity tracking. In The Twelfth International Conference on Learning Representations, 2023.
  41. Fine-tuning aligned language models compromises safety, even when users do not intend to! In The Twelfth International Conference on Learning Representations, 2023.
  42. Immunization against harmful fine-tuning attacks. ArXiv preprint, abs/2402.16382, 2024.
  43. Neuron-level interpretation of deep NLP models: A survey. Transactions of the Association for Computational Linguistics, 10:1285–1303, 2022.
  44. Stanford alpaca: An instruction-following llama model. GitHub repository, 2023.
  45. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  46. Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248, 2023.
  47. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  48. Mitigating fine-tuning jailbreak attack with backdoor enhanced alignment. ArXiv preprint, abs/2402.14968, 2024.
  49. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. In The Eleventh International Conference on Learning Representations, 2022.
  50. Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36, 2024.
  51. Assessing the brittleness of safety alignment via pruning and low-rank modifications. ArXiv preprint, abs/2402.05162, 2024.
  52. Gradsafe: Detecting unsafe prompts for llms via safety-critical gradient analysis. ArXiv preprint, abs/2402.13494, 2024.
  53. Safedecoding: Defending against jailbreak attacks via safety-aware decoding. ArXiv preprint, abs/2402.08983, 2024.
  54. Shadow alignment: The ease of subverting safely-aligned language models. ArXiv preprint, abs/2310.02949, 2023.
  55. Towards best practices of activation patching in language models: Metrics and methods. In The Twelfth International Conference on Learning Representations, 2023.
  56. Intention analysis prompting makes large language models a good jailbreak defender. ArXiv preprint, abs/2401.06561, 2024.
  57. Parden, can you repeat that? defending against jailbreaks via repetition. ArXiv preprint, abs/2405.07932, 2024.
  58. On prompt-driven safeguarding for large language models. In ICLR 2024 Workshop on Secure and Trustworthy Large Language Models, 2024.
  59. Rose doesn’t do that: Boosting the safety of instruction-tuned large language models with reverse prompt contrastive decoding. ArXiv preprint, abs/2402.11889, 2024.
  60. Easyjailbreak: A unified framework for jailbreaking large language models. arXiv preprint arXiv:2403.12171, 2024.
  61. Making harmful behaviors unlearnable for large language models. ArXiv preprint, abs/2311.02105, 2023.
  62. Don’t say no: Jailbreaking llm by suppressing refusal. ArXiv preprint, abs/2404.16369, 2024.
  63. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023.
  64. Universal and transferable adversarial attacks on aligned language models. ArXiv preprint, abs/2307.15043, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Chak Tou Leong (22 papers)
  2. Yi Cheng (78 papers)
  3. Kaishuai Xu (16 papers)
  4. Jian Wang (966 papers)
  5. Hanlin Wang (17 papers)
  6. Wenjie Li (183 papers)
Citations (10)
X Twitter Logo Streamline Icon: https://streamlinehq.com