Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning diverse attacks on large language models for robust red-teaming and safety tuning (2405.18540v1)

Published 28 May 2024 in cs.CL, cs.CR, and cs.LG

Abstract: Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe and responsible deployment of LLMs. Developing effective protection against many modes of attack prompts requires discovering diverse attacks. Automated red-teaming typically uses reinforcement learning to fine-tune an attacker LLM to generate prompts that elicit undesirable responses from a target LLM, as measured, for example, by an auxiliary toxicity classifier. We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks. As a flexible and probabilistically principled alternative, we propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts. We find that the attacks generated by our method are effective against a wide range of target LLMs, both with and without safety tuning, and transfer well between target LLMs. Finally, we demonstrate that models safety-tuned using a dataset of red-teaming prompts generated by our method are robust to attacks from other RL-based red-teaming approaches.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. AI@Meta. Llama 3 model card, 2024. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md.
  2. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
  3. Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073, 2022b.
  4. Emergent tool use from multi-agent autocurricula. International Conference on Learning Representations (ICLR), 2020.
  5. Flow network based generative models for non-iterative diverse candidate generation. Neural Information Processing Systems (NeurIPS), 2021.
  6. GFlowNet foundations. Journal of Machine Learning Research, 24(210):1–55, 2023.
  7. Safety-tuned LLaMAs: Lessons from improving the safety of large language models that follow instructions. International Conference on Learning Representations (ICLR), 2024.
  8. Language models are few-shot learners. Neural Information Processing Systems (NeurIPS), 2020.
  9. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023.
  10. Decision transformer: Reinforcement learning via sequence modeling. Neural Information Processing Systems (NeurIPS), 2021.
  11. Think you have solved question answering? try ARC, the AI2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  12. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  13. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023. URL https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm.
  14. Bayesian structure learning with generative flow networks. Uncertainty in Artificial Intelligence (UAI), 2022.
  15. Build it break it fix it for dialogue safety: Robustness from adversarial human attack. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4537–4546, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1461. URL https://aclanthology.org/D19-1461.
  16. RvS: What is essential for offline RL via supervised learning? International Conference on Learning Representations (ICLR), 2022.
  17. Reinforcement learning with a corrupted reward channel. International Joint Conference on Artificial Intelligence (IJCAI), 2017.
  18. Reinforcement learning with deep energy-based policies. International Conference on Machine Learning (ICML), 2017.
  19. Measuring massive multitask language understanding. International Conference on Learning Representations (ICLR), 2021.
  20. Curiosity-driven red-teaming for large language models. International Conference on Learning Representations (ICLR), 2024.
  21. LoRA: Low-rank adaptation of large language models. International Conference on Learning Representations (ICLR), 2022.
  22. GFlowNet-EM for learning compositional latent variable models. International Conference on Machine Learning (ICML), 2023.
  23. Amortizing intractable inference in large language models. International Conference on Learning Representations (ICLR), 2024.
  24. Llama guard: LLM-based input-output safeguard for human-AI conversations. arXiv preprint arXiv:2312.06674, 2023.
  25. Biological sequence design with gflownets. International Conference on Machine Learning (ICML), 2022.
  26. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023.
  27. BC-Z: zero-shot task generalization with robotic imitation learning. Conference on Robot Learning (CoRL), 2021.
  28. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  29. QGFN: Controllable greediness with action values. arXiv preprint arXiv:2402.05234, 2024.
  30. Query-efficient black-box red teaming via Bayesian optimization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11551–11574, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.646. URL https://aclanthology.org/2023.acl-long.646.
  31. Peter Lee. Learning from Tay’s introduction, 2016. URL https://blogs.microsoft.com/blog/2016/03/25/learning-tays-introduction/.
  32. The unlocking spell on base LLMs: Rethinking alignment via in-context learning. arXiv preprint arXiv:2312.01552, 2023.
  33. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https://aclanthology.org/2022.acl-long.229.
  34. Statistical rejection sampling improves preference optimization. International Conference on Learning Representations (ICLR), 2024a.
  35. AutoDAN: Generating stealthy jailbreak prompts on aligned large language models. International Conference on Learning Representations (ICLR), 2024b.
  36. Trajectory balance: Improved credit assignment in GFlowNets. Neural Information Processing Systems (NeurIPS), 2022.
  37. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249, 2024.
  38. Universal adversarial triggers are not universal. arXiv preprint arXiv:2404.16020, 2024.
  39. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
  40. The effects of reward misspecification: Mapping and mitigating misaligned models. International Conference on Learning Representations (ICLR), 2022.
  41. A deep reinforced model for abstractive summarization. International Conference on Learning Representations (ICLR), 2018.
  42. Red teaming language models with language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3419–3448, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.225. URL https://aclanthology.org/2022.emnlp-main.225.
  43. Language models are unsupervised multitask learners, 2019.
  44. Zero-shot text-to-image generation. International Conference on Machine Learning (ICML), 2021.
  45. Photorealistic text-to-image diffusion models with deep language understanding. Neural Information Processing Systems (NeurIPS), 2022.
  46. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 2021.
  47. Rainbow teaming: Open-ended generation of diverse adversarial prompts. arXiv preprint arXiv:2402.16822, 2024.
  48. Equivalence between policy gradients and soft Q-learning. arXiv preprint arXiv:1704.06440, 2017a.
  49. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017b.
  50. Defining and characterizing reward gaming. Neural Information Processing Systems (NeurIPS), 2022.
  51. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  52. A-NeSI: A scalable approximate method for probabilistic neurosymbolic inference. Neural Information Processing Systems (NeurIPS), 2023.
  53. Learning from the worst: Dynamically generated datasets to improve online hate detection. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1667–1682, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.132. URL https://aclanthology.org/2021.acl-long.132.
  54. Analyzing dynamic adversarial training data in the limit. In Findings of the Association for Computational Linguistics: ACL 2022, pages 202–217, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.18. URL https://aclanthology.org/2022.findings-acl.18.
  55. MiniLMv2: Multi-head self-attention relation distillation for compressing pretrained transformers. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2140–2151, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.188. URL https://aclanthology.org/2021.findings-acl.188.
  56. How far can camels go? exploring the state of instruction tuning on open resources. Neural Information Processing Systems (NeurIPS), 2023.
  57. Jailbroken: How does LLM safety training fail? Neural Information Processing Systems (NeurIPS), 2024.
  58. Chain-of-thought prompting elicits reasoning in large language models. Neural Information Processing Systems (NeurIPS), 2022.
  59. Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359, 2021.
  60. Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229–256, 1992.
  61. Bot-adversarial dialogue for safe conversational agents. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2950–2968, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.235. URL https://aclanthology.org/2021.naacl-main.235.
  62. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1472. URL https://aclanthology.org/P19-1472.
  63. Robust scheduling with GFlownets. International Conference on Learning Representations (ICLR), 2023a.
  64. Let the flows tell: Solving graph combinatorial problems with GFlowNets. Neural Infromation Processing Systems (NeurIPS ), 2023b.
  65. Accelerating greedy coordinate gradient via probe sampling. arXiv preprint arXiv:2403.01251, 2024.
  66. Starling-7B: Improving LLM helpfulness & harmlessness with RLAIF, November 2023.
  67. Texygen: A benchmarking platform for text generation models. SIGIR, 2018.
  68. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Seanie Lee (28 papers)
  2. Minsu Kim (115 papers)
  3. Lynn Cherif (4 papers)
  4. David Dobre (9 papers)
  5. Juho Lee (106 papers)
  6. Sung Ju Hwang (178 papers)
  7. Kenji Kawaguchi (147 papers)
  8. Gauthier Gidel (76 papers)
  9. Yoshua Bengio (601 papers)
  10. Nikolay Malkin (54 papers)
  11. Moksh Jain (30 papers)
Citations (8)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets