DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing Constraints (2405.19026v2)
Abstract: Recent advances in LLM assistants have made them indispensable, raising significant concerns over managing their safety. Automated red teaming offers a promising alternative to the labor-intensive and error-prone manual probing for vulnerabilities, providing more consistent and scalable safety evaluations. However, existing approaches often compromise diversity by focusing on maximizing attack success rate. Additionally, methods that decrease the cosine similarity from historical embeddings with semantic diversity rewards lead to novelty stagnation as history grows. To address these issues, we introduce DiveR-CT, which relaxes conventional constraints on the objective and semantic reward, granting greater freedom for the policy to enhance diversity. Our experiments demonstrate DiveR-CT's marked superiority over baselines by 1) generating data that perform better in various diversity metrics across different attack success rate levels, 2) better-enhancing resiliency in blue team models through safety tuning based on collected data, 3) allowing dynamic control of objective weights for reliable and controllable attack success rates, and 4) reducing susceptibility to reward overoptimization. Overall, our method provides an effective and efficient approach to LLM red teaming, accelerating real-world deployment.
- Constrained policy optimization. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 22–31. PMLR, 2017. URL http://proceedings.mlr.press/v70/achiam17a.html.
- Many-shot jailbreaking. anthropic.com, 2024.
- Thinking inside the box: Controlling and using an oracle AI. Minds Mach., 22(4):299–324, 2012. doi: 10.1007/S11023-012-9282-2. URL https://doi.org/10.1007/s11023-012-9282-2.
- A general language assistant as a laboratory for alignment. CoRR, abs/2112.00861, 2021. URL https://arxiv.org/abs/2112.00861.
- Constitutional AI: harmlessness from AI feedback. CoRR, abs/2212.08073, 2022. doi: 10.48550/ARXIV.2212.08073. URL https://doi.org/10.48550/arXiv.2212.08073.
- Improving question answering model robustness with synthetic adversarial data generation. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 8830–8848. Association for Computational Linguistics, 2021. doi: 10.18653/V1/2021.EMNLP-MAIN.696. URL https://doi.org/10.18653/v1/2021.emnlp-main.696.
- Red-teaming large language models using chain of utterances for safety-alignment. CoRR, abs/2308.09662, 2023. doi: 10.48550/ARXIV.2308.09662. URL https://doi.org/10.48550/arXiv.2308.09662.
- Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions. CoRR, abs/2309.07875, 2023. doi: 10.48550/ARXIV.2309.07875. URL https://doi.org/10.48550/arXiv.2309.07875.
- On the opportunities and risks of foundation models. CoRR, abs/2108.07258, 2021. URL https://arxiv.org/abs/2108.07258.
- Convex optimization. IEEE Transactions on Automatic Control, 51:1859–1859, 2010. URL https://api.semanticscholar.org/CorpusID:37925315.
- State augmented constrained reinforcement learning: Overcoming the limitations of learning with rewards. CoRR, abs/2102.11941, 2021. URL https://arxiv.org/abs/2102.11941.
- Explore, establish, exploit: Red teaming language models from scratch. CoRR, abs/2306.09442, 2023. doi: 10.48550/ARXIV.2306.09442. URL https://doi.org/10.48550/arXiv.2306.09442.
- Seq2sick: Evaluating the robustness of sequence-to-sequence models with adversarial examples. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 3601–3608. AAAI Press, 2020. doi: 10.1609/AAAI.V34I04.5767. URL https://doi.org/10.1609/aaai.v34i04.5767.
- Leveraging the context through multi-round interactions for jailbreaking attacks. CoRR, abs/2402.09177, 2024. doi: 10.48550/ARXIV.2402.09177. URL https://doi.org/10.48550/arXiv.2402.09177.
- Deep reinforcement learning from human preferences. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 4299–4307, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/d5e2c0adad503c91f91df240d0cd4e49-Abstract.html.
- Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457, 2018. URL http://arxiv.org/abs/1803.05457.
- Safe RLHF: Safe reinforcement learning from human feedback. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=TyFrPOKYXw.
- Attack prompt generation for red teaming and defending large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 2176–2189. Association for Computational Linguistics, 2023. doi: 10.18653/V1/2023.FINDINGS-EMNLP.143. URL https://doi.org/10.18653/v1/2023.findings-emnlp.143.
- Build it break it fix it for dialogue safety: Robustness from adversarial human attack. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 4536–4545. Association for Computational Linguistics, 2019. doi: 10.18653/V1/D19-1461. URL https://doi.org/10.18653/v1/D19-1461.
- Attacks, defenses and evaluations for LLM conversation safety: A survey. CoRR, abs/2402.09283, 2024. doi: 10.48550/ARXIV.2402.09283. URL https://doi.org/10.48550/arXiv.2402.09283.
- Hotflip: White-box adversarial examples for text classification. In Iryna Gurevych and Yusuke Miyao, editors, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 2: Short Papers, pages 31–36. Association for Computational Linguistics, 2018. doi: 10.18653/V1/P18-2006. URL https://aclanthology.org/P18-2006/.
- The vendi score: A diversity evaluation metric for machine learning. CoRR, abs/2210.02410, 2022. doi: 10.48550/ARXIV.2210.02410. URL https://doi.org/10.48550/arXiv.2210.02410.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. CoRR, abs/2209.07858, 2022. doi: 10.48550/ARXIV.2209.07858. URL https://doi.org/10.48550/arXiv.2209.07858.
- Scaling laws for reward model overoptimization. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 10835–10866. PMLR, 2023. URL https://proceedings.mlr.press/v202/gao23h.html.
- A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res., 16:1437–1480, 2015. doi: 10.5555/2789272.2886795. URL https://dl.acm.org/doi/10.5555/2789272.2886795.
- MART: improving LLM safety with multi-round automatic red-teaming. CoRR, abs/2311.07689, 2023. doi: 10.48550/ARXIV.2311.07689. URL https://doi.org/10.48550/arXiv.2311.07689.
- Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 3309–3326. Association for Computational Linguistics, 2022. doi: 10.18653/V1/2022.ACL-LONG.234. URL https://doi.org/10.18653/v1/2022.acl-long.234.
- Curiosity-driven red-teaming for large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=4KqkizXgXU.
- Keith Hoskin. The ‘awful idea of accountability’: inscribing people into the measurement of objects. Accountability: Power, ethos and the technologies of managing, 265, 1996.
- Sleeper agents: Training deceptive llms that persist through safety training. CoRR, abs/2401.05566, 2024. doi: 10.48550/ARXIV.2401.05566. URL https://doi.org/10.48550/arXiv.2401.05566.
- AI alignment: A comprehensive survey. CoRR, abs/2310.19852, 2023. doi: 10.48550/ARXIV.2310.19852. URL https://doi.org/10.48550/arXiv.2310.19852.
- Reinforcement learning from human feedback with active queries. ArXiv, abs/2402.09401, 2024. URL https://api.semanticscholar.org/CorpusID:267657539.
- Understanding the effects of RLHF on LLM generalisation and diversity. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=PXD3FAVHJT.
- Query-efficient black-box red teaming via bayesian optimization. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 11551–11574. Association for Computational Linguistics, 2023. doi: 10.18653/V1/2023.ACL-LONG.646. URL https://doi.org/10.18653/v1/2023.acl-long.646.
- A diversity-promoting objective function for neural conversation models. In Kevin Knight, Ani Nenkova, and Owen Rambow, editors, NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pages 110–119. The Association for Computational Linguistics, 2016. doi: 10.18653/V1/N16-1014. URL https://doi.org/10.18653/v1/n16-1014.
- Truthfulqa: Measuring how models mimic human falsehoods. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 3214–3252. Association for Computational Linguistics, 2022. doi: 10.18653/V1/2022.ACL-LONG.229. URL https://doi.org/10.18653/v1/2022.acl-long.229.
- Behavior from the void: Unsupervised active pre-training. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 18459–18473, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/99bf3d153d4bf67d640051a1af322505-Abstract.html.
- Tiny refinements elicit resilience: Toward efficient prefix-model against llm red-teaming, 2024.
- Red teaming game: A game-theoretic framework for red teaming language models. CoRR, abs/2310.00322, 2023. doi: 10.48550/ARXIV.2310.00322. URL https://doi.org/10.48550/arXiv.2310.00322.
- Jointly measuring diversity and quality in text generation models. CoRR, abs/1904.03971, 2019. URL http://arxiv.org/abs/1904.03971.
- Confronting reward model overoptimization with constrained RLHF. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=gkfUvn0fLU.
- Adversarial NLI: A new benchmark for natural language understanding. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 4885–4901. Association for Computational Linguistics, 2020. doi: 10.18653/V1/2020.ACL-MAIN.441. URL https://doi.org/10.18653/v1/2020.acl-main.441.
- Training language models to follow instructions with human feedback. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html.
- Practical black-box attacks against deep learning systems using adversarial examples. CoRR, abs/1602.02697, 2016. URL http://arxiv.org/abs/1602.02697.
- Practical black-box attacks against machine learning. In Ramesh Karri, Ozgur Sinanoglu, Ahmad-Reza Sadeghi, and Xun Yi, editors, Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, AsiaCCS 2017, Abu Dhabi, United Arab Emirates, April 2-6, 2017, pages 506–519. ACM, 2017. doi: 10.1145/3052973.3053009. URL https://doi.org/10.1145/3052973.3053009.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA, pages 311–318. ACL, 2002. doi: 10.3115/1073083.1073135. URL https://aclanthology.org/P02-1040/.
- Curiosity-driven exploration by self-supervised prediction. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 2778–2787. PMLR, 2017. URL http://proceedings.mlr.press/v70/pathak17a.html.
- Red teaming language models with language models. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 3419–3448. Association for Computational Linguistics, 2022. doi: 10.18653/V1/2022.EMNLP-MAIN.225. URL https://doi.org/10.18653/v1/2022.emnlp-main.225.
- Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
- AART: ai-assisted red-teaming with diverse data generation for new llm-powered applications. In Mingxuan Wang and Imed Zitouni, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: EMNLP 2023 - Industry Track, Singapore, December 6-10, 2023, pages 380–395. Association for Computational Linguistics, 2023. doi: 10.18653/V1/2023.EMNLP-INDUSTRY.37. URL https://doi.org/10.18653/v1/2023.emnlp-industry.37.
- Sentence-bert: Sentence embeddings using siamese bert-networks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 3980–3990. Association for Computational Linguistics, 2019. doi: 10.18653/V1/D19-1410. URL https://doi.org/10.18653/v1/D19-1410.
- Winogrande: an adversarial winograd schema challenge at scale. Commun. ACM, 64(9):99–106, 2021. doi: 10.1145/3474381. URL https://doi.org/10.1145/3474381.
- Rainbow teaming: Open-ended generation of diverse adversarial prompts. CoRR, abs/2402.16822, 2024. doi: 10.48550/ARXIV.2402.16822. URL https://doi.org/10.48550/arXiv.2402.16822.
- Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017. URL http://arxiv.org/abs/1707.06347.
- Claude E. Shannon. A mathematical theory of communication. Bell Syst. Tech. J., 27(3):379–423, 1948. doi: 10.1002/J.1538-7305.1948.TB01338.X. URL https://doi.org/10.1002/j.1538-7305.1948.tb01338.x.
- Herbert A. Simon. Rational choice and the structure of the environment. Psychological review, 63 2:129–38, 1956.
- A long way to go: Investigating length correlations in RLHF. CoRR, abs/2310.03716, 2023. doi: 10.48550/ARXIV.2310.03716. URL https://doi.org/10.48550/arXiv.2310.03716.
- Learning to summarize with human feedback. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1f89885d556929e98d3ef9b86448f951-Abstract.html.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Jessica Taylor. Quantilizers: A safer alternative to maximizers for limited optimization. In Blai Bonet, Sven Koenig, Benjamin Kuipers, Illah R. Nourbakhsh, Stuart Russell, Moshe Y. Vardi, and Toby Walsh, editors, AI, Ethics, and Society, Papers from the 2016 AAAI Workshop, Phoenix, Arizona, USA, February 13, 2016, volume WS-16-02 of AAAI Technical Report. AAAI Press, 2016. URL http://www.aaai.org/ocs/index.php/WS/AAAIW16/paper/view/12613.
- Llama Team. Meta llama guard 2. https://github.com/meta-llama/PurpleLlama/blob/main/Llama-Guard2/MODEL_CARD.md, 2024.
- Evaluating the evaluation of diversity in natural language generation. In Paola Merlo, Jörg Tiedemann, and Reut Tsarfaty, editors, Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, EACL 2021, Online, April 19 - 23, 2021, pages 326–346. Association for Computational Linguistics, 2021. doi: 10.18653/V1/2021.EACL-MAIN.25. URL https://doi.org/10.18653/v1/2021.eacl-main.25.
- Learning from the worst: Dynamically generated datasets to improve online hate detection. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 1667–1682. Association for Computational Linguistics, 2021. doi: 10.18653/V1/2021.ACL-LONG.132. URL https://doi.org/10.18653/v1/2021.acl-long.132.
- Trick me if you can: Human-in-the-loop generation of adversarial examples for question answering. Transactions of the Association for Computational Linguistics, 7:387–401, 2018.
- Universal adversarial triggers for attacking and analyzing NLP. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 2153–2162. Association for Computational Linguistics, 2019. doi: 10.18653/V1/D19-1221. URL https://doi.org/10.18653/v1/D19-1221.
- Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
- Gradient-based language model red teaming. In Yvette Graham and Matthew Purver, editors, Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2024 - Volume 1: Long Papers, St. Julian’s, Malta, March 17-22, 2024, pages 2862–2881. Association for Computational Linguistics, 2024. URL https://aclanthology.org/2024.eacl-long.175.
- GPTFUZZER: red teaming large language models with auto-generated jailbreak prompts. CoRR, abs/2309.10253, 2023. doi: 10.48550/ARXIV.2309.10253. URL https://doi.org/10.48550/arXiv.2309.10253.
- Jack Yurkiewicz. Constrained optimization and lagrange multiplier methods, by d. p. bertsekas, academic press, new york, 1982, 395 pp. price: $65.00. Networks, 15(1):138–140, 1985. doi: 10.1002/NET.3230150112. URL https://doi.org/10.1002/net.3230150112.
- Hellaswag: Can a machine really finish your sentence? In Anna Korhonen, David R. Traum, and Lluís Màrquez, editors, Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 4791–4800. Association for Computational Linguistics, 2019. doi: 10.18653/V1/P19-1472. URL https://doi.org/10.18653/v1/p19-1472.
- Improving diversity of commonsense generation by large language models via in-context learning. ArXiv, 2024.
- A mixture of surprises for unsupervised reinforcement learning. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/a7667ee5d545a43d2f0fda98863c260e-Abstract-Conference.html.
- Texygen: A benchmarking platform for text generation models. In Kevyn Collins-Thompson, Qiaozhu Mei, Brian D. Davison, Yiqun Liu, and Emine Yilmaz, editors, The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR 2018, Ann Arbor, MI, USA, July 08-12, 2018, pages 1097–1100. ACM, 2018. doi: 10.1145/3209978.3210080. URL https://doi.org/10.1145/3209978.3210080.
- Fine-tuning language models from human preferences. CoRR, abs/1909.08593, 2019. URL http://arxiv.org/abs/1909.08593.
- Universal and transferable adversarial attacks on aligned language models. CoRR, abs/2307.15043, 2023. doi: 10.48550/ARXIV.2307.15043. URL https://doi.org/10.48550/arXiv.2307.15043.