Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Gradient-Based Language Model Red Teaming (2401.16656v1)

Published 30 Jan 2024 in cs.CL
Gradient-Based Language Model Red Teaming

Abstract: Red teaming is a common strategy for identifying weaknesses in generative LLMs (LMs), where adversarial prompts are produced that trigger an LM to generate unsafe responses. Red teaming is instrumental for both model alignment and evaluation, but is labor-intensive and difficult to scale when done by humans. In this paper, we present Gradient-Based Red Teaming (GBRT), a red teaming method for automatically generating diverse prompts that are likely to cause an LM to output unsafe responses. GBRT is a form of prompt learning, trained by scoring an LM response with a safety classifier and then backpropagating through the frozen safety classifier and LM to update the prompt. To improve the coherence of input prompts, we introduce two variants that add a realism loss and fine-tune a pretrained model to generate the prompts instead of learning the prompts directly. Our experiments show that GBRT is more effective at finding prompts that trigger an LM to generate unsafe responses than a strong reinforcement learning-based red teaming approach, and succeeds even when the LM has been fine-tuned to produce safer outputs.

Introduction

Recent advancements in generative LLMs (LMs) have demonstrated remarkable capabilities in generating coherent text across a range of tasks. However, the capacious output space of these models can sometimes lead to the generation of harmful content, which remains a significant obstacle in deploying them in real-world applications. To mitigate this risk, red teaming—an approach where designated individuals think like an adversary to challenge systems—has been used to probe models for vulnerabilities by finding inputs that trigger unwanted outputs. However, effectively red teaming LMs is often manual, time-consuming, and scales poorly.

Automated Red Teaming

In light of these challenges, an automated approach called Gradient-Based Red Teaming (GBRT) has been developed. GBRT employs gradient-based optimization to craft prompts that induce unsafe responses from LMs. The method involves updating prompts iteratively by backpropagating the errors from a response's safety score—a measure obtained through a safety classifier that discerns safe from unsafe responses. The key innovation here is direct optimization of the prompts using gradient information, rather than optimizing based on the safety score alone, as has been done in prior work utilizing reinforcement learning.

Enhanced Coherence Using Auxiliary Losses

The refinement of GBRT comes through two variants that improve the realism and coherence of the generated prompts. The first variant incorporates a realism loss, which penalizes deviations from expected responses based on a pretrained LM, thus keeping generated prompts aligned with natural language patterns. The second variant strategically fine-tunes a separate LM dedicated to generating red teaming prompts, a method that also benefits from the realism loss but biases the prompt LM to produce more plausible inputs. Numerically, the GBRT-RealismLoss method outperforms other approaches, generating a significantly higher fraction of unique prompts that lead to unsafe responses. This underscores the positive impact that targeted loss functions can have on refining the behavior of automated systems like GBRT.

Empirical Evidence and Human Evaluation

Empirical evaluation confirms the efficacy of GBRT and its variants against a reinforcement learning-based red teaming baseline and a set of human-crafted adversarial prompts from an existing dataset. GBRT variations, notably GBRT-RealismLoss, produce a higher rate of successful prompts, as determined by independent safety classifiers. Human evaluators also concur that the GBRT-RealismLoss method scores well in coherence, albeit with an associated hike in toxicity levels when compared to baselines. Furthermore, an application of GBRT to a model designed to be safer revealed a lower, yet non-negligible, rate of successful red teaming prompts, illustrating the robustness of the approach across varying model sensitivities.

Conclusion

The advancements presented in GBRT demonstrate a systematic approach to uncovering vulnerabilities in LLMs while underlining the nuances of automated red teaming. Not only does GBRT scale red teaming efforts, but also introduces methods to yield coherent red teaming prompts. These contributions hold substantial promise for improving LM safety mechanisms, although the potential for misuse by bad actors should be acknowledged and safeguarded against. With continual research and development in the field of generative AI, tools like GBRT are poised to play a critical role in fortifying the next generation of LLMs against the inadvertent generation of unsafe content.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Palm 2 technical report. arXiv preprint arXiv:2305.10403.
  2. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
  3. Steven Bird and Edward Loper. 2004. NLTK: The natural language toolkit. In Proceedings of the ACL Interactive Poster and Demonstration Sessions, pages 214–217, Barcelona, Spain. Association for Computational Linguistics.
  4. Explore, establish, exploit: Red teaming language models from scratch. arXiv preprint arXiv:2306.09442.
  5. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
  6. RLPrompt: Optimizing discrete text prompts with reinforcement learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3369–3391, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  7. Gradient-based adversarial attacks against text transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5747–5757, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  8. ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3309–3326, Dublin, Ireland. Association for Computational Linguistics.
  9. Curiosity-driven red-teaming for large language models. In The Twelfth International Conference on Learning Representations.
  10. Categorical reparameterization with gumbel-softmax. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net.
  11. Automatically auditing large language models via discrete optimization. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 15307–15329. PMLR.
  12. Query-efficient black-box red teaming via bayesian optimization. arXiv preprint arXiv:2305.17444.
  13. A new generation of perspective API: efficient multilingual character-level transformers. In KDD ’22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 14 - 18, 2022, pages 3197–3207. ACM.
  14. Content preserving text generation with attribute controls. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc.
  15. The concrete distribution: A continuous relaxation of discrete random variables. In International Conference on Learning Representations.
  16. Robust conversational agents against imperceptible toxicity triggers. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2831–2847, Seattle, United States. Association for Computational Linguistics.
  17. FLIRT: Feedback loop in-context red teaming.
  18. Controlled decoding from language models.
  19. OpenAI. 2023. Gpt-4 technical report.
  20. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.
  21. A plug-and-play method for controlled text generation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3973–3997, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  22. Red teaming language models with language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3419–3448, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  23. Direct preference optimization: Your language model is secretly a reward model. CoRR, abs/2305.18290.
  24. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4222–4235, Online. Association for Computational Linguistics.
  25. Large language models encode clinical knowledge. Nature, 620(7972):172–180.
  26. Lamda: Language models for dialog applications. CoRR, abs/2201.08239.
  27. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  28. Universal adversarial triggers for attacking and analyzing NLP. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2153–2162, Hong Kong, China. Association for Computational Linguistics.
  29. Bot-adversarial dialogue for safe conversational agents. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 2950–2968. Association for Computational Linguistics.
  30. Kevin Yang and Dan Klein. 2021. Fudge: Controlled text generation with future discriminators. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3511–3535.
  31. Unsupervised text style transfer using language models as discriminators. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc.
  32. Texygen: A benchmarking platform for text generation models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR 2018, Ann Arbor, MI, USA, July 08-12, 2018, pages 1097–1100. ACM.
  33. Universal and transferable adversarial attacks on aligned language models.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Nevan Wichers (11 papers)
  2. Carson Denison (10 papers)
  3. Ahmad Beirami (86 papers)
Citations (18)
Youtube Logo Streamline Icon: https://streamlinehq.com