Curiosity-driven Red-teaming for Large Language Models (2402.19464v1)
Abstract: LLMs hold great potential for many natural language applications but risk generating incorrect or toxic content. To probe when an LLM generates unwanted content, the current paradigm is to recruit a \textit{red team} of human testers to design input prompts (i.e., test cases) that elicit undesirable responses from LLMs. However, relying solely on human testers is expensive and time-consuming. Recent works automate red teaming by training a separate red team LLM with reinforcement learning (RL) to generate test cases that maximize the chance of eliciting undesirable responses from the target LLM. However, current RL methods are only able to generate a small number of effective test cases resulting in a low coverage of the span of prompts that elicit undesirable responses from the target LLM. To overcome this limitation, we draw a connection between the problem of increasing the coverage of generated test cases and the well-studied approach of curiosity-driven exploration that optimizes for novelty. Our method of curiosity-driven red teaming (CRT) achieves greater coverage of test cases while mantaining or increasing their effectiveness compared to existing methods. Our method, CRT successfully provokes toxic responses from LLaMA2 model that has been heavily fine-tuned using human preferences to avoid toxic outputs. Code is available at \url{https://github.com/Improbable-AI/curiosity_redteam}
- Jointly measuring diversity and quality in text generation models. In Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, pp. 90–98, Minneapolis, Minnesota, jun 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-2311. URL https://www.aclweb.org/anthology/W19-2311.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
- Unifying count-based exploration and intrinsic motivation. In NIPS, 2016.
- Flow network based generative models for non-iterative diverse candidate generation. Advances in Neural Information Processing Systems, 34:27381–27394, 2021.
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp. 2397–2430. PMLR, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Exploration by random network distillation. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=H1lJJnR5Ym.
- Explore, establish, exploit: Red teaming language models from scratch. arXiv preprint arXiv:2306.09442, 2023.
- trlX: A scalable framework for RLHF, June 2023. URL https://github.com/CarperAI/trlx.
- Learning to generate better than your llm. arXiv preprint arXiv:2306.11816, 2023.
- Redeeming intrinsic rewards via constrained optimization. Advances in Neural Information Processing Systems, 35:4996–5008, 2022.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Increasing diversity while maintaining accuracy: Text data generation with large language models and human interventions. arXiv preprint arXiv:2306.04140, 2023.
- Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023. URL https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm.
- Hotflip: White-box adversarial examples for text classification. arXiv preprint arXiv:1712.06751, 2017.
- Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995, 2019.
- Counterfactuals of counterfactuals: a back-translation-inspired approach to analyse counterfactual editors. arXiv preprint arXiv:2305.17055, 2023.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.
- A survey of text similarity approaches. international journal of Computer Applications, 68(13):13–18, 2013.
- Detoxify. Github. https://github.com/unitaryai/detoxify, 2020.
- Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv preprint arXiv:2203.09509, 2022.
- Provably efficient maximum entropy exploration. In International Conference on Machine Learning, pp. 2681–2691. PMLR, 2019.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Query-efficient black-box red teaming via bayesian optimization. arXiv preprint arXiv:2305.17444, 2023.
- Peter Lee. Learning from tay’s introduction. https://blogs.microsoft.com/blog/2016/03/25/learning-tays-introduction/, 2016.
- Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P11-1015.
- Flirt: Feedback loop in-context red teaming. arXiv preprint arXiv:2308.04265, 2023.
- Asynchronous methods for deep reinforcement learning. In ICML, 2016.
- Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707, 2023.
- Intrinsic motivation systems for autonomous mental development. Evolutionary Computation, 2007.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318, 2002.
- Curiosity-driven exploration by self-supervised prediction. In Proceedings of the 34th International Conference on Machine Learning, pp. 2778–2787, 2017.
- Red teaming language models with language models. arXiv preprint arXiv:2202.03286, 2022.
- Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. URL http://arxiv.org/abs/1908.10084.
- Equivalence between policy gradients and soft q-learning. arXiv preprint arXiv:1704.06440, 2017a.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017b.
- Softmax function. Softmax function — Wikipedia, the free encyclopedia, 2023. URL https://en.wikipedia.org/wiki/Softmax_function.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
- Reinforcement learning: An introduction. 2018.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Evaluating the evaluation of diversity in natural language generation. arXiv preprint arXiv:2004.02990, 2020.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Learning from the worst: Dynamically generated datasets to improve online hate detection. In ACL, 2021.
- Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020.
- Universal adversarial triggers for attacking and analyzing nlp. arXiv preprint arXiv:1908.07125, 2019.
- Texygen: A benchmarking platform for text generation models. In The 41st international ACM SIGIR conference on research & development in information retrieval, pp. 1097–1100, 2018.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
- Zhang-Wei Hong (31 papers)
- Idan Shenfeld (10 papers)
- Tsun-Hsuan Wang (37 papers)
- Yung-Sung Chuang (37 papers)
- Aldo Pareja (7 papers)
- James Glass (173 papers)
- Akash Srivastava (50 papers)
- Pulkit Agrawal (103 papers)