Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Curiosity-driven Red-teaming for Large Language Models (2402.19464v1)

Published 29 Feb 2024 in cs.LG, cs.AI, and cs.CL

Abstract: LLMs hold great potential for many natural language applications but risk generating incorrect or toxic content. To probe when an LLM generates unwanted content, the current paradigm is to recruit a \textit{red team} of human testers to design input prompts (i.e., test cases) that elicit undesirable responses from LLMs. However, relying solely on human testers is expensive and time-consuming. Recent works automate red teaming by training a separate red team LLM with reinforcement learning (RL) to generate test cases that maximize the chance of eliciting undesirable responses from the target LLM. However, current RL methods are only able to generate a small number of effective test cases resulting in a low coverage of the span of prompts that elicit undesirable responses from the target LLM. To overcome this limitation, we draw a connection between the problem of increasing the coverage of generated test cases and the well-studied approach of curiosity-driven exploration that optimizes for novelty. Our method of curiosity-driven red teaming (CRT) achieves greater coverage of test cases while mantaining or increasing their effectiveness compared to existing methods. Our method, CRT successfully provokes toxic responses from LLaMA2 model that has been heavily fine-tuned using human preferences to avoid toxic outputs. Code is available at \url{https://github.com/Improbable-AI/curiosity_redteam}

Curiosity-driven Red-teaming for LLMs

The paper "Curiosity-driven Red-teaming for LLMs" presents a novel approach to uncovering the vulnerabilities of LLMs by employing curiosity-driven exploration methods. These methods are intended to improve the diversity and effectiveness of test prompts designed to elicit undesirable behavior from LLMs. This research navigates the limitations of traditional reinforcement learning (RL) methods in automating red teaming (the process of probing systems for flaws) by emphasizing a strategy rooted in curiosity-driven exploration approaches commonly found in RL.

The authors acknowledge the challenges posed by the vast parameter spaces of contemporary LLMs, which complicate the task of identifying input prompts capable of triggering harmful, unsafe, or toxic outputs. Traditional strategies for this involve human-based red teaming, which proves to be both time-intensive and cost prohibitive. Automated systems leverage RL by training a dedicated red team LLM to generate these inputs, yet these systems often fall short in terms of producing a diverse set of effective test cases.

The paper's core proposition stems from an innovative adoption of curiosity-driven exploration, aiming to enhance the coverage of red teaming prompts by maximizing their novelty. In their approach, the authors modify the RL training process for the red team LLMs to simultaneously consider rewards for eliciting unwanted responses and incorporate entropy bonuses for maintaining randomness. They introduce novelty rewards based on n-gram modeling (SelfBLEU) and sentence embeddings to quantitatively assess the freshness of the generated test cases.

The experimental evaluations are grounded in text continuation and instruction following tasks across several models, including a heavily fine-tuned LLaMA2 model. The results demonstrate that curiosity-driven exploration not only maintains but often exceeds the test-case effectiveness of previous RL-based methods while also ensuring a broader diversity in the types of prompts these models are exposed to. This was notably effective in undermining LLMs optimized with reinforcement learning from human feedback, suggesting that such methods remain insufficient for complete safety assurance.

A significant implication of the research is the demonstrated utility of curiosity-driven methods in red teaming, illustrating their potential in enhancing the robustness and safety of LLMs. By systematically fostering exploration and broadening the testing landscape, the research indicates that LLMs can be more thoroughly evaluated for potentially harmful behaviors.

The findings advocate for future advancements in the domain of AI safety, underscoring the need for continued exploration of curiosity-based strategies not just for LLMs but across AI deployment scenarios where unpredictable interactions with humans might yield undesirable behaviors. As AI systems evolve in complexity and application scope, the methodologies outlined in the paper may serve as a blueprint for rigorous safety checks.

In conclusion, this research presents a compelling extension to standard RL frameworks for red teaming, leveraging curiosity-driven exploration to enhance both the breadth and precision of model testing. The work may prompt the development of even more expansive exploration techniques that can better capture the multifaceted challenges posed by LLMs in dynamic and sensitive contexts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Jointly measuring diversity and quality in text generation models. In Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation, pp.  90–98, Minneapolis, Minnesota, jun 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-2311. URL https://www.aclweb.org/anthology/W19-2311.
  2. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  3. Unifying count-based exploration and intrinsic motivation. In NIPS, 2016.
  4. Flow network based generative models for non-iterative diverse candidate generation. Advances in Neural Information Processing Systems, 34:27381–27394, 2021.
  5. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp. 2397–2430. PMLR, 2023.
  6. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  7. Exploration by random network distillation. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=H1lJJnR5Ym.
  8. Explore, establish, exploit: Red teaming language models from scratch. arXiv preprint arXiv:2306.09442, 2023.
  9. trlX: A scalable framework for RLHF, June 2023. URL https://github.com/CarperAI/trlx.
  10. Learning to generate better than your llm. arXiv preprint arXiv:2306.11816, 2023.
  11. Redeeming intrinsic rewards via constrained optimization. Advances in Neural Information Processing Systems, 35:4996–5008, 2022.
  12. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  13. Increasing diversity while maintaining accuracy: Text data generation with large language models and human interventions. arXiv preprint arXiv:2306.04140, 2023.
  14. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023. URL https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm.
  15. Hotflip: White-box adversarial examples for text classification. arXiv preprint arXiv:1712.06751, 2017.
  16. Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995, 2019.
  17. Counterfactuals of counterfactuals: a back-translation-inspired approach to analyse counterfactual editors. arXiv preprint arXiv:2305.17055, 2023.
  18. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.
  19. A survey of text similarity approaches. international journal of Computer Applications, 68(13):13–18, 2013.
  20. Detoxify. Github. https://github.com/unitaryai/detoxify, 2020.
  21. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv preprint arXiv:2203.09509, 2022.
  22. Provably efficient maximum entropy exploration. In International Conference on Machine Learning, pp. 2681–2691. PMLR, 2019.
  23. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  24. Query-efficient black-box red teaming via bayesian optimization. arXiv preprint arXiv:2305.17444, 2023.
  25. Peter Lee. Learning from tay’s introduction. https://blogs.microsoft.com/blog/2016/03/25/learning-tays-introduction/, 2016.
  26. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.  142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P11-1015.
  27. Flirt: Feedback loop in-context red teaming. arXiv preprint arXiv:2308.04265, 2023.
  28. Asynchronous methods for deep reinforcement learning. In ICML, 2016.
  29. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707, 2023.
  30. Intrinsic motivation systems for autonomous mental development. Evolutionary Computation, 2007.
  31. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  32. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp.  311–318, 2002.
  33. Curiosity-driven exploration by self-supervised prediction. In Proceedings of the 34th International Conference on Machine Learning, pp.  2778–2787, 2017.
  34. Red teaming language models with language models. arXiv preprint arXiv:2202.03286, 2022.
  35. Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
  36. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  37. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. URL http://arxiv.org/abs/1908.10084.
  38. Equivalence between policy gradients and soft q-learning. arXiv preprint arXiv:1704.06440, 2017a.
  39. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017b.
  40. Softmax function. Softmax function — Wikipedia, the free encyclopedia, 2023. URL https://en.wikipedia.org/wiki/Softmax_function.
  41. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  42. Reinforcement learning: An introduction. 2018.
  43. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  44. Evaluating the evaluation of diversity in natural language generation. arXiv preprint arXiv:2004.02990, 2020.
  45. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  46. Learning from the worst: Dynamically generated datasets to improve online hate detection. In ACL, 2021.
  47. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020.
  48. Universal adversarial triggers for attacking and analyzing nlp. arXiv preprint arXiv:1908.07125, 2019.
  49. Texygen: A benchmarking platform for text generation models. In The 41st international ACM SIGIR conference on research & development in information retrieval, pp.  1097–1100, 2018.
  50. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Zhang-Wei Hong (31 papers)
  2. Idan Shenfeld (10 papers)
  3. Tsun-Hsuan Wang (37 papers)
  4. Yung-Sung Chuang (37 papers)
  5. Aldo Pareja (7 papers)
  6. James Glass (173 papers)
  7. Akash Srivastava (50 papers)
  8. Pulkit Agrawal (103 papers)
Citations (28)