Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
106 tokens/sec
Gemini 2.5 Pro Premium
53 tokens/sec
GPT-5 Medium
26 tokens/sec
GPT-5 High Premium
27 tokens/sec
GPT-4o
109 tokens/sec
DeepSeek R1 via Azure Premium
91 tokens/sec
GPT OSS 120B via Groq Premium
515 tokens/sec
Kimi K2 via Groq Premium
213 tokens/sec
2000 character limit reached

Self-playing Adversarial Language Game Enhances LLM Reasoning (2404.10642v3)

Published 16 Apr 2024 in cs.CL and cs.LG

Abstract: We explore the potential of self-play training for LLMs in a two-player adversarial language game called Adversarial Taboo. In this game, an attacker and a defender communicate around a target word only visible to the attacker. The attacker aims to induce the defender to speak the target word unconsciously, while the defender tries to infer the target word from the attacker's utterances. To win the game, both players must have sufficient knowledge about the target word and high-level reasoning ability to infer and express in this information-reserved conversation. Hence, we are curious about whether LLMs' reasoning ability can be further enhanced by Self-Playing this Adversarial language Game (SPAG). With this goal, we select several open-source LLMs and let each act as the attacker and play with a copy of itself as the defender on an extensive range of target words. Through reinforcement learning on the game outcomes, we observe that the LLMs' performances uniformly improve on a broad range of reasoning benchmarks. Furthermore, iteratively adopting this self-play process can continuously promote LLMs' reasoning abilities. The code is available at https://github.com/Linear95/SPAG.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Lmrl gym: Benchmarks for multi-turn reinforcement learning with language models. arXiv preprint arXiv:2311.18232, 2023.
  2. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  3. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  4. Llemma: An open language model for mathematics. In The Twelfth International Conference on Learning Representations, 2023.
  5. Leftover-lunch: Advantage-based offline reinforcement learning for language models. In International Conference on Learning Representations, 2024.
  6. Scheduled sampling for sequence prediction with recurrent neural networks. Advances in neural information processing systems, 28, 2015.
  7. S. Bird. Nltk: the natural language toolkit. In Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions, pages 69–72, 2006.
  8. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020.
  9. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  10. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. arXiv preprint arXiv:2312.09390, 2023.
  11. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024.
  12. Everyone deserves a reward: Learning customized human preferences. arXiv preprint arXiv:2309.03126, 2023a.
  13. Adversarial preference optimization. arXiv preprint arXiv:2311.08045, 2023b.
  14. A survey of chain of thought reasoning: Advances, frontiers and future. arXiv preprint arXiv:2309.15402, 2023.
  15. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457, 2018.
  16. Mutual: A dataset for multi-turn dialogue reasoning. In Proceedings of the 58th Conference of the Association for Computational Linguistics. Association for Computational Linguistics, 2020.
  17. M. Davies. COCA: Corpus of contemporary american english, 2020. URL https://www.english-corpora.org/coca/.
  18. Everything of thoughts: Defying the law of penrose triangle for thought generation. arXiv preprint arXiv:2311.04254, 2023.
  19. How abilities in large language models are affected by supervised fine-tuning data composition. arXiv preprint arXiv:2310.05492, 2023a.
  20. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023b.
  21. A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
  22. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023.
  23. Interactive fiction games: A colossal adventure. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7903–7910, 2020.
  24. Measuring massive multitask language understanding. CoRR, abs/2009.03300, 2020. URL https://arxiv.org/abs/2009.03300.
  25. Large language models can self-improve. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
  26. Is chatgpt a good translator? a preliminary study. arXiv preprint arXiv:2301.08745, 2023.
  27. Chatgpt: Jack of all trades, master of none. Information Fusion, 99:101861, 2023.
  28. S. Kullback. Information theory and statistics. Courier Corporation, 1997.
  29. Deal or no deal? end-to-end learning of negotiation dialogues. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2443–2453, 2017.
  30. M. L. Littman. Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994, pages 157–163. Elsevier, 1994.
  31. Logiqa 2.0 — an improved dataset for logical reasoning in natural language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing, pages 1–16, 2023. doi: 10.1109/TASLP.2023.3293046.
  32. Red teaming game: A game-theoretic framework for red teaming language models. arXiv e-prints, pages arXiv–2310, 2023.
  33. R. M. Neal. Annealed importance sampling. Statistics and computing, 11:125–139, 2001.
  34. OpenAI. ChatGPT, Mar 14 version. https://chat.openai.com/chat, 2023a.
  35. OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023b.
  36. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  37. Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
  38. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  39. Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization. In International Conference on Learning Representations, 2023.
  40. Winogrande: An adversarial winograd schema challenge at scale. arXiv preprint arXiv:1907.10641, 2019.
  41. Trust region policy optimization. In International conference on machine learning, pages 1889–1897. PMLR, 2015.
  42. High-dimensional continuous control using generalized advantage estimation. In International Conference on Learning Representations, 2016.
  43. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  44. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
  45. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
  46. Beyond human data: Scaling self-training for problem-solving with language models. arXiv preprint arXiv:2312.06585, 2023.
  47. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2022.
  48. Use chat gpt to solve programming bugs. International Journal of Information Technology & Computer Engineering (IJITC) ISSN: 2455-5290, 3(01):17–22, 2023.
  49. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
  50. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  51. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  52. Is chatgpt the ultimate programming assistant–how far is it? arXiv preprint arXiv:2304.11938, 2023.
  53. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  54. Language models don’t always say what they think: unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36, 2024.
  55. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
  56. Enhance reasoning for large language models in the game werewolf. arXiv preprint arXiv:2402.02330, 2024.
  57. Exploring large language models for communication games: An empirical study on werewolf. arXiv preprint arXiv:2309.04658, 2023.
  58. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023a.
  59. Harnessing the power of llms in practice: A survey on chatgpt and beyond. ACM Transactions on Knowledge Discovery from Data, 2023b.
  60. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024.
  61. Adversarial language games for advanced natural language intelligence. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 14248–14256, 2021.
  62. Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2024.
  63. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023.
  64. How language model hallucinations can snowball. arXiv preprint arXiv:2305.13534, 2023.
Citations (10)

Summary

  • The paper introduces SPAG, a self-play adversarial language game that enhances LLM reasoning without relying on human-generated data.
  • It combines imitation learning with reinforcement learning from self-play episodes to iteratively improve performance across multiple reasoning benchmarks.
  • Experiments with models like LLaMA-2-7B and Baichuan-2-13B demonstrate consistent gains in reasoning and strategic gameplay against GPT-4.

Enhancing Reasoning in LLMs through Adversarial Language Game Self-Play

Introduction to SPAG

Recent advancements in LLMs such as GPT-4 and LLaMA have driven remarkable progress in AI's natural language understanding and generation capabilities. Despite their success, the enhancement of LLMs' reasoning abilities remains a significant challenge. This paper introduces a novel approach named Self-Play from Adversarial language Game (SPAG), aiming to improve LLMs' reasoning capabilities without human data, by engaging them in self-play within an adversarial language game known as Adversarial Taboo.

Adversarial Taboo and Self-Play

In the Adversarial Taboo game, an "attacker" (also an LLM) and a "defender" (its counterpart) engage in a dialogue with the goal of inducing the defender to mention a target word known only to the attacker. This setup requires both participants to exhibit a deep understanding of the target word, as well as strategic reasoning capabilities to navigate the conversation effectively. The research leveraged this game's nature by applying self-play, where an LLM plays both roles interchangeably, learning from its performance through reinforcement learning techniques.

Reinforcement Learning (RL) Approach

The SPAG methodology involves initially preparing LLMs to follow the game's rules through imitation learning based on dialogues generated by other advanced models like GPT-4. Following this, it engages the models in self-play iterations, where they accumulate experience by playing numerous game episodes against themselves. The outcomes of these episodes guide the reinforcement learning process, aiming to iteratively refine and elevate the LLMs' reasoning abilities. Importantly, the paper adapts offline reinforcement learning methods to overcome inefficiencies associated with online learning in the context of natural text generation.

Empirical Validation

Two open-source pre-trained models, LLaMA-2-7B and Baichuan-2-13B, were used to test the SPAG approach. The evaluation was conducted across several reasoning benchmarks, including BIG-Bench Hard, ARC Easy and Challenge, Mutual, WinoGrande, LogiQA2, PIQA, and a comprehensive language understanding metric, MMLU. The results were promising, showing continuous improvement in reasoning performance across multiple benchmarks as the models underwent successive self-play epochs. Additionally, when the models trained through SPAG were pitted against GPT-4 in the Adversarial Taboo game, their win rates improved consistently, indicating enhanced gameplay strategy and, by extension, reasoning abilities.

Discussion and Future Directions

The findings suggest that engaging LLMs in strategic adversarial language games through self-play offers a viable path to enhancing their reasoning skills, beyond the capabilities acquired through conventional training methods. This approach, inspired by the success of self-play in developing strategic game-playing AIs like AlphaGO, underscores the potential of adversarial language games in advancing AI reasoning without relying on human-generated training data.

Given the significant performance gains observed, future work could explore the application of SPAG to a broader range of LLM architectures and reasoning tasks. Additionally, further refinement of the self-play and reinforcement learning methodologies could unlock even greater advancements in LLM reasoning capabilities. This paper represents a compelling step toward realizing more sophisticated, strategic, and reasoning-capable LLMs.

Concluding Remarks

In summary, SPAG emerges as a potent methodology for enhancing the reasoning abilities of LLMs, advocating for the utility of adversarial language games and self-play in AI development. The approach's success across various benchmarks and its direct impact on game-playing strategies emphasize its potential as a fundamental tool in the future development of AI reasoning skills.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com