Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
120 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
55 tokens/sec
2000 character limit reached

Robust LLM safeguarding via refusal feature adversarial training (2409.20089v2)

Published 30 Sep 2024 in cs.LG, cs.CL, and cs.CR

Abstract: LLMs are vulnerable to adversarial attacks that can elicit harmful responses. Defending against such attacks remains challenging due to the opacity of jailbreaking mechanisms and the high computational cost of training LLMs robustly. We demonstrate that adversarial attacks share a universal mechanism for circumventing LLM safeguards that works by ablating a dimension in the residual stream embedding space called the refusal feature. We further show that the operation of refusal feature ablation (RFA) approximates the worst-case perturbation of offsetting model safety. Based on these findings, we propose Refusal Feature Adversarial Training (ReFAT), a novel algorithm that efficiently performs LLM adversarial training by simulating the effect of input-level attacks via RFA. Experiment results show that ReFAT significantly improves the robustness of three popular LLMs against a wide range of adversarial attacks, with considerably less computational overhead compared to existing adversarial training methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Jailbreaking leading safety-aligned llms with simple adaptive attacks. arXiv preprint arXiv:2404.02151, 2024.
  2. Many-shot jailbreaking. Anthropic, April, 2024.
  3. Refusal in language models is mediated by a single direction. arXiv preprint arXiv:2406.11717, 2024.
  4. Discovering latent knowledge in language models without supervision. In The Eleventh International Conference on Learning Representations, 2023.
  5. Defending against unforeseen failure modes with latent adversarial training. arXiv preprint arXiv:2403.05030, 2024.
  6. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023.
  7. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023.
  8. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
  9. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html.
  10. Toy models of superposition. arXiv preprint arXiv:2209.10652, 2022.
  11. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  12. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  13. Inspecting and editing knowledge representations in language models. arXiv preprint arXiv:2304.00740, 2023.
  14. Intrinsic evaluation of unlearning using parametric knowledge traces. arXiv preprint arXiv:2406.11614, 2024.
  15. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674, 2023.
  16. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  17. Smart: Robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  2177–2190, 2020.
  18. Prometheus: Inducing fine-grained evaluation capability in language models. In The Twelfth International Conference on Learning Representations, 2023.
  19. Inference-time intervention: Eliciting truthful answers from a language model. Advances in Neural Information Processing Systems, 36, 2024.
  20. Autodan: Generating stealthy jailbreak prompts on aligned large language models. In The Twelfth International Conference on Learning Representations, 2023.
  21. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018.
  22. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824, 2023.
  23. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249, 2024.
  24. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies, pp.  746–751, 2013.
  25. The linear representation hypothesis and the geometry of large language models. In Causal Representation Learning Workshop at NeurIPS 2023, 2023.
  26. Null it out: Guarding protected attributes by iterative nullspace projection. arXiv preprint arXiv:2004.07667, 2020.
  27. Steering llama 2 via contrastive activation addition. arXiv preprint arXiv:2312.06681, 2023.
  28. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. arXiv preprint arXiv:2308.01263, 2023.
  29. Adversarial attacks and defenses in large language models: Old and new threats. In Proceedings on, pp.  103–117. PMLR, 2023.
  30. ” do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023.
  31. Targeted latent adversarial training improves robustness to persistent harmful behaviors in llms. arXiv preprint arXiv:2407.15549, 2024.
  32. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  4222–4235, 2020.
  33. A strongreject for empty jailbreaks. arXiv preprint arXiv:2402.10260, 2024.
  34. Steering without side effects: Improving post-deployment control of language models. arXiv preprint arXiv:2406.15518, 2024.
  35. Stanford alpaca: an instruction-following llama model (2023). URL https://github. com/tatsu-lab/stanford_alpaca, 1(9), 2023.
  36. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
  37. Linear representations of sentiment in large language models. arXiv preprint arXiv:2310.15154, 2023.
  38. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  39. Tradeoffs between alignment and helpfulness in language models. arXiv preprint arXiv:2401.16332, 2024.
  40. Efficient adversarial training in llms with continuous attacks. arXiv preprint arXiv:2405.15589, 2024.
  41. Defending chatgpt against jailbreak attack via self-reminders. Nature Machine Intelligence, 5(12):1486–1496, 2023.
  42. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253, 2023.
  43. On prompt-driven safeguarding for large language models. In ICLR 2024 Workshop on Secure and Trustworthy Large Language Models, 2024.
  44. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  45. Freelb: Enhanced adversarial training for natural language understanding. arXiv preprint arXiv:1909.11764, 2019.
  46. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023a.
  47. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023b.
  48. Improving alignment and robustness with circuit breakers, 2024.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.