Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing Constraints (2405.19026v2)

Published 29 May 2024 in cs.LG, cs.AI, cs.CL, and cs.CR

Abstract: Recent advances in LLM assistants have made them indispensable, raising significant concerns over managing their safety. Automated red teaming offers a promising alternative to the labor-intensive and error-prone manual probing for vulnerabilities, providing more consistent and scalable safety evaluations. However, existing approaches often compromise diversity by focusing on maximizing attack success rate. Additionally, methods that decrease the cosine similarity from historical embeddings with semantic diversity rewards lead to novelty stagnation as history grows. To address these issues, we introduce DiveR-CT, which relaxes conventional constraints on the objective and semantic reward, granting greater freedom for the policy to enhance diversity. Our experiments demonstrate DiveR-CT's marked superiority over baselines by 1) generating data that perform better in various diversity metrics across different attack success rate levels, 2) better-enhancing resiliency in blue team models through safety tuning based on collected data, 3) allowing dynamic control of objective weights for reliable and controllable attack success rates, and 4) reducing susceptibility to reward overoptimization. Overall, our method provides an effective and efficient approach to LLM red teaming, accelerating real-world deployment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. Constrained policy optimization. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 22–31. PMLR, 2017. URL http://proceedings.mlr.press/v70/achiam17a.html.
  2. Many-shot jailbreaking. anthropic.com, 2024.
  3. Thinking inside the box: Controlling and using an oracle AI. Minds Mach., 22(4):299–324, 2012. doi: 10.1007/S11023-012-9282-2. URL https://doi.org/10.1007/s11023-012-9282-2.
  4. A general language assistant as a laboratory for alignment. CoRR, abs/2112.00861, 2021. URL https://arxiv.org/abs/2112.00861.
  5. Constitutional AI: harmlessness from AI feedback. CoRR, abs/2212.08073, 2022. doi: 10.48550/ARXIV.2212.08073. URL https://doi.org/10.48550/arXiv.2212.08073.
  6. Improving question answering model robustness with synthetic adversarial data generation. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 8830–8848. Association for Computational Linguistics, 2021. doi: 10.18653/V1/2021.EMNLP-MAIN.696. URL https://doi.org/10.18653/v1/2021.emnlp-main.696.
  7. Red-teaming large language models using chain of utterances for safety-alignment. CoRR, abs/2308.09662, 2023. doi: 10.48550/ARXIV.2308.09662. URL https://doi.org/10.48550/arXiv.2308.09662.
  8. Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions. CoRR, abs/2309.07875, 2023. doi: 10.48550/ARXIV.2309.07875. URL https://doi.org/10.48550/arXiv.2309.07875.
  9. On the opportunities and risks of foundation models. CoRR, abs/2108.07258, 2021. URL https://arxiv.org/abs/2108.07258.
  10. Convex optimization. IEEE Transactions on Automatic Control, 51:1859–1859, 2010. URL https://api.semanticscholar.org/CorpusID:37925315.
  11. State augmented constrained reinforcement learning: Overcoming the limitations of learning with rewards. CoRR, abs/2102.11941, 2021. URL https://arxiv.org/abs/2102.11941.
  12. Explore, establish, exploit: Red teaming language models from scratch. CoRR, abs/2306.09442, 2023. doi: 10.48550/ARXIV.2306.09442. URL https://doi.org/10.48550/arXiv.2306.09442.
  13. Seq2sick: Evaluating the robustness of sequence-to-sequence models with adversarial examples. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 3601–3608. AAAI Press, 2020. doi: 10.1609/AAAI.V34I04.5767. URL https://doi.org/10.1609/aaai.v34i04.5767.
  14. Leveraging the context through multi-round interactions for jailbreaking attacks. CoRR, abs/2402.09177, 2024. doi: 10.48550/ARXIV.2402.09177. URL https://doi.org/10.48550/arXiv.2402.09177.
  15. Deep reinforcement learning from human preferences. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 4299–4307, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/d5e2c0adad503c91f91df240d0cd4e49-Abstract.html.
  16. Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457, 2018. URL http://arxiv.org/abs/1803.05457.
  17. Safe RLHF: Safe reinforcement learning from human feedback. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=TyFrPOKYXw.
  18. Attack prompt generation for red teaming and defending large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 2176–2189. Association for Computational Linguistics, 2023. doi: 10.18653/V1/2023.FINDINGS-EMNLP.143. URL https://doi.org/10.18653/v1/2023.findings-emnlp.143.
  19. Build it break it fix it for dialogue safety: Robustness from adversarial human attack. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 4536–4545. Association for Computational Linguistics, 2019. doi: 10.18653/V1/D19-1461. URL https://doi.org/10.18653/v1/D19-1461.
  20. Attacks, defenses and evaluations for LLM conversation safety: A survey. CoRR, abs/2402.09283, 2024. doi: 10.48550/ARXIV.2402.09283. URL https://doi.org/10.48550/arXiv.2402.09283.
  21. Hotflip: White-box adversarial examples for text classification. In Iryna Gurevych and Yusuke Miyao, editors, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 2: Short Papers, pages 31–36. Association for Computational Linguistics, 2018. doi: 10.18653/V1/P18-2006. URL https://aclanthology.org/P18-2006/.
  22. The vendi score: A diversity evaluation metric for machine learning. CoRR, abs/2210.02410, 2022. doi: 10.48550/ARXIV.2210.02410. URL https://doi.org/10.48550/arXiv.2210.02410.
  23. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. CoRR, abs/2209.07858, 2022. doi: 10.48550/ARXIV.2209.07858. URL https://doi.org/10.48550/arXiv.2209.07858.
  24. Scaling laws for reward model overoptimization. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 10835–10866. PMLR, 2023. URL https://proceedings.mlr.press/v202/gao23h.html.
  25. A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res., 16:1437–1480, 2015. doi: 10.5555/2789272.2886795. URL https://dl.acm.org/doi/10.5555/2789272.2886795.
  26. MART: improving LLM safety with multi-round automatic red-teaming. CoRR, abs/2311.07689, 2023. doi: 10.48550/ARXIV.2311.07689. URL https://doi.org/10.48550/arXiv.2311.07689.
  27. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 3309–3326. Association for Computational Linguistics, 2022. doi: 10.18653/V1/2022.ACL-LONG.234. URL https://doi.org/10.18653/v1/2022.acl-long.234.
  28. Curiosity-driven red-teaming for large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=4KqkizXgXU.
  29. Keith Hoskin. The ‘awful idea of accountability’: inscribing people into the measurement of objects. Accountability: Power, ethos and the technologies of managing, 265, 1996.
  30. Sleeper agents: Training deceptive llms that persist through safety training. CoRR, abs/2401.05566, 2024. doi: 10.48550/ARXIV.2401.05566. URL https://doi.org/10.48550/arXiv.2401.05566.
  31. AI alignment: A comprehensive survey. CoRR, abs/2310.19852, 2023. doi: 10.48550/ARXIV.2310.19852. URL https://doi.org/10.48550/arXiv.2310.19852.
  32. Reinforcement learning from human feedback with active queries. ArXiv, abs/2402.09401, 2024. URL https://api.semanticscholar.org/CorpusID:267657539.
  33. Understanding the effects of RLHF on LLM generalisation and diversity. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=PXD3FAVHJT.
  34. Query-efficient black-box red teaming via bayesian optimization. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 11551–11574. Association for Computational Linguistics, 2023. doi: 10.18653/V1/2023.ACL-LONG.646. URL https://doi.org/10.18653/v1/2023.acl-long.646.
  35. A diversity-promoting objective function for neural conversation models. In Kevin Knight, Ani Nenkova, and Owen Rambow, editors, NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pages 110–119. The Association for Computational Linguistics, 2016. doi: 10.18653/V1/N16-1014. URL https://doi.org/10.18653/v1/n16-1014.
  36. Truthfulqa: Measuring how models mimic human falsehoods. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 3214–3252. Association for Computational Linguistics, 2022. doi: 10.18653/V1/2022.ACL-LONG.229. URL https://doi.org/10.18653/v1/2022.acl-long.229.
  37. Behavior from the void: Unsupervised active pre-training. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 18459–18473, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/99bf3d153d4bf67d640051a1af322505-Abstract.html.
  38. Tiny refinements elicit resilience: Toward efficient prefix-model against llm red-teaming, 2024.
  39. Red teaming game: A game-theoretic framework for red teaming language models. CoRR, abs/2310.00322, 2023. doi: 10.48550/ARXIV.2310.00322. URL https://doi.org/10.48550/arXiv.2310.00322.
  40. Jointly measuring diversity and quality in text generation models. CoRR, abs/1904.03971, 2019. URL http://arxiv.org/abs/1904.03971.
  41. Confronting reward model overoptimization with constrained RLHF. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=gkfUvn0fLU.
  42. Adversarial NLI: A new benchmark for natural language understanding. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 4885–4901. Association for Computational Linguistics, 2020. doi: 10.18653/V1/2020.ACL-MAIN.441. URL https://doi.org/10.18653/v1/2020.acl-main.441.
  43. Training language models to follow instructions with human feedback. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html.
  44. Practical black-box attacks against deep learning systems using adversarial examples. CoRR, abs/1602.02697, 2016. URL http://arxiv.org/abs/1602.02697.
  45. Practical black-box attacks against machine learning. In Ramesh Karri, Ozgur Sinanoglu, Ahmad-Reza Sadeghi, and Xun Yi, editors, Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, AsiaCCS 2017, Abu Dhabi, United Arab Emirates, April 2-6, 2017, pages 506–519. ACM, 2017. doi: 10.1145/3052973.3053009. URL https://doi.org/10.1145/3052973.3053009.
  46. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA, pages 311–318. ACL, 2002. doi: 10.3115/1073083.1073135. URL https://aclanthology.org/P02-1040/.
  47. Curiosity-driven exploration by self-supervised prediction. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 2778–2787. PMLR, 2017. URL http://proceedings.mlr.press/v70/pathak17a.html.
  48. Red teaming language models with language models. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 3419–3448. Association for Computational Linguistics, 2022. doi: 10.18653/V1/2022.EMNLP-MAIN.225. URL https://doi.org/10.18653/v1/2022.emnlp-main.225.
  49. Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
  50. AART: ai-assisted red-teaming with diverse data generation for new llm-powered applications. In Mingxuan Wang and Imed Zitouni, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: EMNLP 2023 - Industry Track, Singapore, December 6-10, 2023, pages 380–395. Association for Computational Linguistics, 2023. doi: 10.18653/V1/2023.EMNLP-INDUSTRY.37. URL https://doi.org/10.18653/v1/2023.emnlp-industry.37.
  51. Sentence-bert: Sentence embeddings using siamese bert-networks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 3980–3990. Association for Computational Linguistics, 2019. doi: 10.18653/V1/D19-1410. URL https://doi.org/10.18653/v1/D19-1410.
  52. Winogrande: an adversarial winograd schema challenge at scale. Commun. ACM, 64(9):99–106, 2021. doi: 10.1145/3474381. URL https://doi.org/10.1145/3474381.
  53. Rainbow teaming: Open-ended generation of diverse adversarial prompts. CoRR, abs/2402.16822, 2024. doi: 10.48550/ARXIV.2402.16822. URL https://doi.org/10.48550/arXiv.2402.16822.
  54. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017. URL http://arxiv.org/abs/1707.06347.
  55. Claude E. Shannon. A mathematical theory of communication. Bell Syst. Tech. J., 27(3):379–423, 1948. doi: 10.1002/J.1538-7305.1948.TB01338.X. URL https://doi.org/10.1002/j.1538-7305.1948.tb01338.x.
  56. Herbert A. Simon. Rational choice and the structure of the environment. Psychological review, 63 2:129–38, 1956.
  57. A long way to go: Investigating length correlations in RLHF. CoRR, abs/2310.03716, 2023. doi: 10.48550/ARXIV.2310.03716. URL https://doi.org/10.48550/arXiv.2310.03716.
  58. Learning to summarize with human feedback. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1f89885d556929e98d3ef9b86448f951-Abstract.html.
  59. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  60. Jessica Taylor. Quantilizers: A safer alternative to maximizers for limited optimization. In Blai Bonet, Sven Koenig, Benjamin Kuipers, Illah R. Nourbakhsh, Stuart Russell, Moshe Y. Vardi, and Toby Walsh, editors, AI, Ethics, and Society, Papers from the 2016 AAAI Workshop, Phoenix, Arizona, USA, February 13, 2016, volume WS-16-02 of AAAI Technical Report. AAAI Press, 2016. URL http://www.aaai.org/ocs/index.php/WS/AAAIW16/paper/view/12613.
  61. Llama Team. Meta llama guard 2. https://github.com/meta-llama/PurpleLlama/blob/main/Llama-Guard2/MODEL_CARD.md, 2024.
  62. Evaluating the evaluation of diversity in natural language generation. In Paola Merlo, Jörg Tiedemann, and Reut Tsarfaty, editors, Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, EACL 2021, Online, April 19 - 23, 2021, pages 326–346. Association for Computational Linguistics, 2021. doi: 10.18653/V1/2021.EACL-MAIN.25. URL https://doi.org/10.18653/v1/2021.eacl-main.25.
  63. Learning from the worst: Dynamically generated datasets to improve online hate detection. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 1667–1682. Association for Computational Linguistics, 2021. doi: 10.18653/V1/2021.ACL-LONG.132. URL https://doi.org/10.18653/v1/2021.acl-long.132.
  64. Trick me if you can: Human-in-the-loop generation of adversarial examples for question answering. Transactions of the Association for Computational Linguistics, 7:387–401, 2018.
  65. Universal adversarial triggers for attacking and analyzing NLP. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 2153–2162. Association for Computational Linguistics, 2019. doi: 10.18653/V1/D19-1221. URL https://doi.org/10.18653/v1/D19-1221.
  66. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
  67. Gradient-based language model red teaming. In Yvette Graham and Matthew Purver, editors, Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2024 - Volume 1: Long Papers, St. Julian’s, Malta, March 17-22, 2024, pages 2862–2881. Association for Computational Linguistics, 2024. URL https://aclanthology.org/2024.eacl-long.175.
  68. GPTFUZZER: red teaming large language models with auto-generated jailbreak prompts. CoRR, abs/2309.10253, 2023. doi: 10.48550/ARXIV.2309.10253. URL https://doi.org/10.48550/arXiv.2309.10253.
  69. Jack Yurkiewicz. Constrained optimization and lagrange multiplier methods, by d. p. bertsekas, academic press, new york, 1982, 395 pp. price: $65.00. Networks, 15(1):138–140, 1985. doi: 10.1002/NET.3230150112. URL https://doi.org/10.1002/net.3230150112.
  70. Hellaswag: Can a machine really finish your sentence? In Anna Korhonen, David R. Traum, and Lluís Màrquez, editors, Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 4791–4800. Association for Computational Linguistics, 2019. doi: 10.18653/V1/P19-1472. URL https://doi.org/10.18653/v1/p19-1472.
  71. Improving diversity of commonsense generation by large language models via in-context learning. ArXiv, 2024.
  72. A mixture of surprises for unsupervised reinforcement learning. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/a7667ee5d545a43d2f0fda98863c260e-Abstract-Conference.html.
  73. Texygen: A benchmarking platform for text generation models. In Kevyn Collins-Thompson, Qiaozhu Mei, Brian D. Davison, Yiqun Liu, and Emine Yilmaz, editors, The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR 2018, Ann Arbor, MI, USA, July 08-12, 2018, pages 1097–1100. ACM, 2018. doi: 10.1145/3209978.3210080. URL https://doi.org/10.1145/3209978.3210080.
  74. Fine-tuning language models from human preferences. CoRR, abs/1909.08593, 2019. URL http://arxiv.org/abs/1909.08593.
  75. Universal and transferable adversarial attacks on aligned language models. CoRR, abs/2307.15043, 2023. doi: 10.48550/ARXIV.2307.15043. URL https://doi.org/10.48550/arXiv.2307.15043.
Citations (2)

Summary

  • The paper introduces a novel red teaming method that relaxes reward constraints to enhance query diversity without sacrificing attack success rates.
  • The paper employs constrained reinforcement learning and a dynamic semantic diversity reward to generate more varied and novel attack queries compared to previous methods.
  • The paper demonstrates that its approach mitigates overoptimization and strengthens blue team resilience through rigorous experimental validation.

A Formal Analysis of DiveR-CT in Enhancing Automated Red Teaming for LLM Safety

Managing the safety of LLMs is of paramount importance as they become increasingly prevalent in diverse applications. Manual red teaming, where experts identify vulnerabilities by interacting with these models, is labor-intensive and subjective. Consequently, automated red teaming has emerged as a promising substitution, offering consistency and scalability. However, existing methods often compromise data diversity in their pursuit to maximize attack success rates (ASR). This paper introduces a novel approach, Diversity-enhanced red teaming with Relaxing ConstrainTs (DiveR-CT), which focuses on enhancing diversity without sacrificing effectiveness.

Methodology

The methodology centers on treating unsafe rewards as threshold constraints instead of strict optimization targets. This broader perspective allows the policy to balance between achieving successful attacks and maintaining semantic diversity. Further contributions include a dynamic nearest-neighbor reward system to measure semantic diversity, addressing limitations observed in previous methods such as Curiosity Red Teaming (CRT).

Constrained Objectives

In traditional red teaming, the objective predominantly focuses on maximizing ASR, which leads to a narrow set of attack queries and potential overoptimization issues. Diver-CT relaxes this by employing Constrained Reinforcement Learning (CRL), where unsafe rewards are treated as constraints. This shift not only broadens the space of potential queries but also mitigates the overfitting of the red team policies to high-confidence unsafe scores. Subsequent experimental results validate that this approach significantly improves the diversity of red teaming outputs across different ASR levels.

Dynamic Semantic Diversity Reward

A novel semantic reward mechanism is proposed based on the nearest neighbor approach, ensuring dynamic adaptability as the history of generated queries grows. This mitigates the diminishing returns seen in models like CRT, where rewards based on historical embedding similarities lead to novelty stagnation. DiveR-CT’s use of dynamic targets fosters uniform semantic space coverage, prompting the generation of semantically diverse attack queries.

Experimental Validation

The effectiveness of DiveR-CT was validated through comprehensive experiments. The researchers report significant findings in several areas:

  1. Diversity Metrics: DiveR-CT outperforms existing methods in both lexical and semantic diversity metrics. Across various ASR levels, DiveR-CT consistently generates more diverse and semantically rich queries compared to CRT and traditional RL-based methods.
  2. Attack Success Rate: The proposed method provides dynamic control of objective weights, enabling reliable ASR adjustments. This flexibility allows fine-tuning attack success rates without compromising the quality and diversity of the generated queries.
  3. Overoptimization Resistance: By avoiding strict maximization of unsafe reward scores, DiveR-CT shows reduced susceptibility to overoptimization. The generated queries perform comparably well against a test classifier not seen during training, evidencing the method’s generalizability and robustness.
  4. Enhanced Blue Team Resilience: Fine-tuning blue team models with data generated by DiveR-CT improves their resilience to adversarial attacks more effectively than when using data from existing methods. This is attributed to the higher diversity and broader range of vulnerabilities explored by DiveR-CT.

The experiments extended to varying scenarios, including using different safety classifiers and targeting more robust blue team models like Llama-2-7b-chat-hf and Meta-Llama-3-8B-Instruct. In all instances, DiveR-CT maintained controlled ASR and superior diversity metrics, reinforcing its applicability in different contexts.

Implications and Future Directions

The practical implications of DiveR-CT are significant. By enabling more comprehensive and diverse automated red teaming, this method enhances the safety protocols of LLMs, ensuring they are robust against a wider array of potential exploits. From a theoretical perspective, DiveR-CT's approach to incorporating constraints in reinforcement learning opens new avenues for balancing objectives in adversarial settings.

Future research could explore extending DiveR-CT’s principles to multi-turn interactions and incorporating domain knowledge to ensure uniform topic coverage in red teaming queries. Additionally, the use of dynamic thresholds could be refined further to adapt to evolving contexts and model behaviors.

Conclusion

DiveR-CT represents an innovative step in automated red teaming, addressing the critical balance between attack effectiveness and query diversity. By employing constrained objectives and dynamic rewards, it alleviates the issues inherent in existing methods, such as overoptimization and limited query diversity. This method significantly contributes to enhancing the safety and robustness of LLMs, mitigating potential risks in their deployment. As automated red teaming continues to evolve, DiveR-CT stands out as a pivotal development in ensuring the comprehensive evaluation and fortification of LLMs against adversarial threats.