Evolving Diverse Red-team Language Models in Multi-round Multi-agent Games (2310.00322v5)

Published 30 Sep 2023 in cs.CL and cs.GT

Abstract: The primary challenge in deploying LLM is ensuring its harmlessness. Red team can identify vulnerabilities by attacking LLM to attain safety. However, current efforts heavily rely on single-round prompt designs and unilateral red team optimizations against fixed blue teams. These static approaches lead to significant reductions in generation diversity, known as the mode collapse, which makes it difficult to discover the potential risks in the increasingly complex human-LLM interactions. Here we introduce dynamic Red Team Game (RTG) to comprehensively analyze the multi-round offensive and defensive interactions between red team and blue team. Furthermore, we develop a Gamified Red Team Solver (GRTS) with diversity measures to mitigate mode collapse and theoretically guarantee the convergence of approximate Nash equilibrium which results in better strategies for both teams. Empirical results demonstrate that GRTS explore diverse and implicit attacks to adaptively exploit various LLMs, surpassing the constraints of specific modes. Insightfully, the geometrical structure we unveil of the red team task aligns with the spinning top hypothesis, confirming the necessity of constructing a diverse LLM population as a promising proxy for heterogeneous human expert red-teamers. This paves the way for scalable toxicity detection and safe alignment for LLMs.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (85)

Authors (7)

Chengdong Ma (12 papers)
Ziran Yang (6 papers)
Minquan Gao (3 papers)
Hai Ci (22 papers)
Jun Gao (267 papers)
Xuehai Pan (12 papers)
Yaodong Yang (169 papers)

Citations (5)

View on Semantic Scholar

Tweets

https://twitter.com/econ_cs/status/1777547472769892664

https://twitter.com/econ_cs/status/1775239575578796456

Evolving Diverse Red-team Language Models in Multi-round Multi-agent Games (2310.00322v5)

Related Papers

Tweets