Evolving Diverse Red-team Language Models in Multi-round Multi-agent Games (2310.00322v5)
Abstract: The primary challenge in deploying LLM is ensuring its harmlessness. Red team can identify vulnerabilities by attacking LLM to attain safety. However, current efforts heavily rely on single-round prompt designs and unilateral red team optimizations against fixed blue teams. These static approaches lead to significant reductions in generation diversity, known as the mode collapse, which makes it difficult to discover the potential risks in the increasingly complex human-LLM interactions. Here we introduce dynamic Red Team Game (RTG) to comprehensively analyze the multi-round offensive and defensive interactions between red team and blue team. Furthermore, we develop a Gamified Red Team Solver (GRTS) with diversity measures to mitigate mode collapse and theoretically guarantee the convergence of approximate Nash equilibrium which results in better strategies for both teams. Empirical results demonstrate that GRTS explore diverse and implicit attacks to adaptively exploit various LLMs, surpassing the constraints of specific modes. Insightfully, the geometrical structure we unveil of the red team task aligns with the spinning top hypothesis, confirming the necessity of constructing a diverse LLM population as a promising proxy for heterogeneous human expert red-teamers. This paves the way for scalable toxicity detection and safe alignment for LLMs.
- Large language models associate muslims with violence. Nature Machine Intelligence, 3(6):461–463, 2021.
- Anthropic. Meet claude. https://www.anthropic.com/product, 2023.
- A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
- N. N. Author. Suppressed for anonymity, 2021.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
- Open-ended learning in symmetric zero-sum games. In International Conference on Machine Learning, pp. 434–443. PMLR, 2019.
- Improving question answering model robustness with synthetic adversarial data generation. arXiv preprint arXiv:2104.08678, 2021a.
- Models in the loop: Aiding crowdworkers with generative annotation assistants. arXiv preprint arXiv:2112.09062, 2021b.
- Evaluating the underlying gender bias in contextualized word embeddings. arXiv preprint arXiv:1904.08783, 2019.
- Universal adversarial attacks on text classifiers. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7345–7349. IEEE, 2019.
- Stochastic approximations and differential inclusions. SIAM Journal on Control and Optimization, 44(1):328–348, 2005.
- On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pp. 610–623, 2021.
- Ulrich Berger. Brown’s original fictitious play. Journal of Economic Theory, 135(1):572–578, 2007.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pp. 2633–2650, 2021.
- Explore, establish, exploit: Red teaming language models from scratch. arXiv preprint arXiv:2306.09442, 2023.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
- Pku-beaver: Constrained value-aligned llm via safe rlhf. https://github.com/PKU-Alignment/safe-rlhf, 2023.
- Build it break it fix it for dialogue safety: Robustness from adversarial human attack. arXiv preprint arXiv:1908.06083, 2019.
- Anticipating safety issues in e2e conversational ai: Framework and tooling. arXiv preprint arXiv:2107.03451, 2021.
- Online double oracle. arXiv preprint arXiv:2103.07780, 2021.
- Measuring and mitigating unintended bias in text classification. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 67–73, 2018.
- Pattern Classification. John Wiley and Sons, 2nd edition, 2000.
- The theory of learning in games, volume 2. MIT press, 1998.
- Predictability and surprise in large generative models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pp. 1747–1764, 2022a.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022b.
- Counterfactual fairness in text classification through robustness. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pp. 219–226, 2019.
- Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462, 2020.
- Negative training for neural dialogue response generation. arXiv preprint arXiv:1903.02134, 2019.
- Deep reinforcement learning from self-play in imperfect-information games. arXiv preprint arXiv:1603.01121, 2016.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Social biases in nlp models as barriers for persons with disabilities. arXiv preprint arXiv:2005.00813, 2020.
- Beavertails: Towards improved safety alignment of llm via a human-preference dataset. ArXiv, abs/2307.04657, 2023. URL https://api.semanticscholar.org/CorpusID:259501579.
- Adversarial examples for evaluating reading comprehension systems. arXiv preprint arXiv:1707.07328, 2017.
- Avoiding reasoning shortcuts: Adversarial evaluation, training, and model development for multi-hop qa. arXiv preprint arXiv:1906.07132, 2019.
- Accelerating best response calculation in large extensive games. In IJCAI, volume 11, pp. 258–265, 2011.
- Christina Kim John Schulman, Barret Zoph and Jacob Hilton. Introducing chatgpt. https://openai.com/blog/chatgpt, 2022.
- Efficiently computing nash equilibria in adversarial team markov games. arXiv preprint arXiv:2208.02204, 2022.
- M. J. Kearns. Computational Complexity of Machine Learning. PhD thesis, Department of Computer Science, Harvard University, 1989.
- Measuring bias in contextualized word representations. arXiv preprint arXiv:1906.07337, 2019.
- A unified game-theoretic approach to multiagent reinforcement learning. Advances in neural information processing systems, 30, 2017.
- OpenSpiel: A framework for reinforcement learning in games. CoRR, abs/1908.09453, 2019. URL http://arxiv.org/abs/1908.09453.
- P. Langley. Crafting papers on machine learning. In Pat Langley (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
- Generalised weakened fictitious play. Games and Economic Behavior, 56(2):285–298, 2006.
- Don’t say that! making inconsistent dialogue unlikely with unlikelihood training. arXiv preprint arXiv:1911.03860, 2019.
- Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994, pp. 157–163. Elsevier, 1994.
- Towards unifying behavioral and response diversity for open-ended learning in zero-sum games. Advances in Neural Information Processing Systems, 34:941–952, 2021.
- A unified diversity measure for multiagent reinforcement learning. Advances in Neural Information Processing Systems, 35:10339–10352, 2022.
- Xdo: A double oracle algorithm for extensive-form games. Advances in Neural Information Processing Systems, 34:23128–23139, 2021.
- Planning in the presence of cost functions controlled by an adversary. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 536–543, 2003.
- Machine Learning: An Artificial Intelligence Approach, Vol. I. Tioga, Palo Alto, CA, 1983.
- Microsoft. Deepspeed examples. https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat, 2023.
- T. M. Mitchell. The need for biases in learning generalizations. Technical report, Computer Science Department, Rutgers University, New Brunswick, MA, 1980.
- A generalized training approach for multiagent learning. arXiv preprint arXiv:1909.12823, 2019.
- A. Newell and P. S. Rosenbloom. Mechanisms of skill acquisition and the law of practice. In J. R. Anderson (ed.), Cognitive Skills and Their Acquisition, chapter 1, pp. 1–51. Lawrence Erlbaum Associates, Inc., Hillsdale, NJ, 1981.
- Adversarial nli: A new benchmark for natural language understanding. arXiv preprint arXiv:1910.14599, 2019.
- α𝛼\alphaitalic_α-rank: Multi-agent evaluation by evolution. Scientific reports, 9(1):9937, 2019.
- Guillermo Owen. Game theory. Emerald Group Publishing, 2013.
- Effective diversity in population based reinforcement learning. Advances in Neural Information Processing Systems, 33:18050–18062, 2020.
- Red teaming language models with language models. arXiv preprint arXiv:2202.03286, 2022.
- Modelling behavioural diversity for learning in open-ended games. In International conference on machine learning, pp. 8514–8524. PMLR, 2021.
- Martin L Puterman. Markov decision processes. Handbooks in operations research and management science, 2:331–434, 1990.
- [email protected]. Anthropic-hh-rlhf. https://github.com/anthropics/hh-rlhf, 2023.
- Beyond accuracy: Behavioral testing of nlp models with checklist. arXiv preprint arXiv:2005.04118, 2020.
- Klaus Ritzberger et al. The theory of extensive form games. Springer, 2016.
- Tailor: Generating and perturbing text with semantic controls. arXiv preprint arXiv:2107.07150, 2021.
- Hatecheck: Functional tests for hate speech detection models. arXiv preprint arXiv:2012.15606, 2020.
- Hierarchical reinforcement learning for open-domain dialog. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp. 8741–8748, 2020.
- A. L. Samuel. Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 3(3):211–229, 1959.
- Mixed-integer programming methods for finding nash equilibria. In AAAI, pp. 495–501, 2005.
- Social bias frames: Reasoning about social and power implications of language. arXiv preprint arXiv:1911.03891, 2019.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Red teaming language model detectors with language models. arXiv preprint arXiv:2305.19713, 2023.
- Multiagent systems: Algorithmic, game-theoretic, and logical foundations. Cambridge University Press, 2008.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
- Principle-driven self-alignment of language models from scratch with minimal human supervision. arXiv preprint arXiv:2305.03047, 2023.
- Understanding the capabilities, limitations, and societal impact of large language models. arXiv preprint arXiv:2102.02503, 2021.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Team-maxmin equilibria. Games and Economic Behavior, 21(1-2):309–321, 1997.
- Universal adversarial triggers for attacking and analyzing nlp. arXiv preprint arXiv:1908.07125, 2019.
- Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359, 2021.
- Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319, 2019.
- Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862, 2021a.
- Polyjuice: Generating counterfactuals for explaining, evaluating, and improving models. arXiv preprint arXiv:2101.00288, 2021b.
- Bot-adversarial dialogue for safe conversational agents. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2950–2968, 2021.
- Chengdong Ma (12 papers)
- Ziran Yang (6 papers)
- Minquan Gao (3 papers)
- Hai Ci (22 papers)
- Jun Gao (267 papers)
- Xuehai Pan (12 papers)
- Yaodong Yang (169 papers)