Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Explore, Establish, Exploit: Red Teaming Language Models from Scratch (2306.09442v3)

Published 15 Jun 2023 in cs.CL, cs.AI, and cs.LG

Abstract: Deploying LLMs (LMs) can pose hazards from harmful outputs such as toxic or false text. Prior work has introduced automated tools that elicit harmful outputs to identify these risks. While this is a valuable step toward securing models, these approaches rely on a pre-existing way to efficiently classify undesirable outputs. Using a pre-existing classifier does not allow for red-teaming to be tailored to the target model. Furthermore, when failures can be easily classified in advance, red-teaming has limited marginal value because problems can be avoided by simply filtering training data and/or model outputs. Here, we consider red-teaming "from scratch," in which the adversary does not begin with a way to classify failures. Our framework consists of three steps: 1) Exploring the model's range of behaviors in the desired context; 2) Establishing a definition and measurement for undesired behavior (e.g., a classifier trained to reflect human evaluations); and 3) Exploiting the model's flaws using this measure to develop diverse adversarial prompts. We use this approach to red-team GPT-3 to discover classes of inputs that elicit false statements. In doing so, we construct the CommonClaim dataset of 20,000 statements labeled by humans as common-knowledge-true, common knowledge-false, or neither. We are making code and data available.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Toxic comment classification challenge, 2017. URL https://kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge.
  2. Muppet: Massive multi-task representations with pre-finetuning. CoRR, abs/2101.11038, 2021. URL https://arxiv.org/abs/2101.11038.
  3. Surge AI. Surge ai, 2023. URL https://www.surgehq.ai.
  4. Multifc: A real-world multi-domain dataset for evidence-based fact checking of claims. arXiv preprint arXiv:1909.03242, 2019.
  5. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
  6. Improving question answering model robustness with synthetic adversarial data generation. arXiv preprint arXiv:2104.08678, 2021.
  7. Quantifying hypothesis space misspecification in learning from human–robot demonstrations and physical corrections. IEEE Transactions on Robotics, 36(3):835–854, 2020.
  8. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827, 2022.
  9. CarperAI. Transformer reinforcement learning x. https://github.com/CarperAI/trlx, 2022.
  10. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023.
  11. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  12. Prithiviraj Damodaran. Parrot: Paraphrase generation for nlu., 2021.
  13. Rlprompt: Optimizing discrete text prompts with reinforcement learning. arXiv preprint arXiv:2205.12548, 2022.
  14. Hard choices in artificial intelligence. Artificial Intelligence, 300:103555, 2021.
  15. Preference formation. Annual Review of Political Science, 3(1):1–24, 2000.
  16. Hotflip: White-box adversarial examples for text classification. arXiv preprint arXiv:1712.06751, 2017.
  17. Truthful ai: Developing and governing ai that does not lie. arXiv preprint arXiv:2110.06674, 2021.
  18. Choice set misspecification in reward inference. arXiv preprint arXiv:2101.07691, 2021.
  19. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.
  20. Chatgpt outperforms crowd-workers for text-annotation tasks. arXiv preprint arXiv:2303.15056, 2023.
  21. Ground (less) truth: A causal framework for proxy labels in human-algorithm decision-making. arXiv preprint arXiv:2302.06503, 2023.
  22. Gradient-based adversarial attacks against text transformers. arXiv preprint arXiv:2104.13733, 2021.
  23. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
  24. Automatically auditing large language models via discrete optimization. arXiv preprint arXiv:2303.04381, 2023.
  25. Pretraining language models with human preferences. arXiv preprint arXiv:2302.08582, 2023.
  26. Anis Koubaa. Gpt-4 vs. gpt-3.5: A concise showdown. 2023.
  27. Hurdles to progress in long-form question answering. arXiv preprint arXiv:2103.06332, 2021.
  28. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. arXiv preprint arXiv:2302.09664, 2023.
  29. Gradient-based constrained sampling from language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  2251–2277, 2022.
  30. BA Levinstein and Daniel A Herrmann. Still no lie detector for language models: Probing empirical and conceptual roadblocks. arXiv preprint arXiv:2307.00175, 2023.
  31. Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197, 2023.
  32. Textbugger: Generating adversarial text against real-world applications. arXiv preprint arXiv:1812.05271, 2018.
  33. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
  34. Teaching models to express their uncertainty in words. arXiv preprint arXiv:2205.14334, 2022.
  35. Humans are not boltzmann distributions: Challenges and opportunities for modelling human feedback and interaction in reinforcement learning. arXiv preprint arXiv:2206.13316, 2022.
  36. Jailbreaking chatgpt via prompt engineering: An empirical study, 2023.
  37. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  38. On faithfulness and factuality in abstractive summarization. arXiv preprint arXiv:2005.00661, 2020.
  39. Flirt: Feedback loop in-context red teaming. arXiv preprint arXiv:2308.04265, 2023.
  40. Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147, 2022.
  41. Ethical aspects of multi-stakeholder recommendation systems. The information society, 37(1):35–45, 2021.
  42. A.J. Oneal. Chat gpt ”dan” (and other ”jailbreaks”). https://gist.github.com/coolaj86/6f4f7b30129b0251f61fa7baaa881516, 2023.
  43. Creak: A dataset for commonsense reasoning over entity knowledge. arXiv preprint arXiv:2109.01653, 2021.
  44. OpenAI. Introducing chatgpt, 2023. URL https://openai.com/blog/chatgpt.
  45. Probing toxic content in large pre-trained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  4262–4274, 2021.
  46. Red teaming language models with language models. arXiv preprint arXiv:2202.03286, 2022a.
  47. Discovering language model behaviors with model-written evaluations. arXiv preprint arXiv:2212.09251, 2022b.
  48. Kilt: a benchmark for knowledge intensive language tasks. arXiv preprint arXiv:2009.02252, 2020.
  49. Grips: Gradient-free, edit-based instruction search for prompting large language models. arXiv preprint arXiv:2203.07281, 2022.
  50. Magdalena Price. Open Coding for Machine Learning. PhD thesis, Massachusetts Institute of Technology, 2022.
  51. Tricking llms into disobedience: Understanding, analyzing, and preventing jailbreaks, 2023.
  52. Generating natural language adversarial examples through probability weighted word saliency. In Proceedings of the 57th annual meeting of the association for computational linguistics, pp.  1085–1097, 2019.
  53. Whose opinions do language models reflect? arXiv preprint arXiv:2303.17548, 2023.
  54. Diversity subsampling: Custom subsamples from large data sets. arXiv preprint arXiv:2206.10812, 2022.
  55. Toward human readable prompt tuning: Kubrick’s the shining is a good movie, and a good prompt too? arXiv preprint arXiv:2212.10539, 2022.
  56. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020.
  57. Retrieval augmentation reduces hallucination in conversation. arXiv preprint arXiv:2104.07567, 2021.
  58. Universal adversarial attacks with natural triggers for text classification. arXiv preprint arXiv:2005.00174, 2020.
  59. Fever: a large-scale dataset for fact extraction and verification. arXiv preprint arXiv:1803.05355, 2018.
  60. Universal adversarial triggers for attacking and analyzing nlp. arXiv preprint arXiv:1908.07125, 2019.
  61. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
  62. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32, 2019.
  63. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
  64. Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. arXiv preprint arXiv:2302.03668, 2023.
  65. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Stephen Casper (40 papers)
  2. Jason Lin (8 papers)
  3. Joe Kwon (5 papers)
  4. Gatlen Culp (1 paper)
  5. Dylan Hadfield-Menell (54 papers)
Citations (73)