Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Efficacy of Language Model Self-Play in Non-Zero-Sum Games (2406.18872v2)

Published 27 Jun 2024 in cs.CL
Efficacy of Language Model Self-Play in Non-Zero-Sum Games

Abstract: Game-playing agents like AlphaGo have achieved superhuman performance through self-play, which is theoretically guaranteed to yield optimal policies in competitive games. However, most language tasks are partially or fully cooperative, so it is an open question whether techniques like self-play can effectively be used to improve LLMs. We empirically investigate this question in a negotiation game setting known as Deal or No Deal (DoND). Crucially, the objective in DoND can be modified to produce a fully cooperative game, a strictly competitive one, or anything in between. We finetune LLMs in self-play over multiple rounds of filtered behavior cloning in DoND for each of these objectives and evaluate them in self-play and in collaboration with humans. We find that LLMs improve substantially in self-play, achieving 14-17x higher scores in task reward after finetuning. Further, the trained models generalize to both cooperation and competition with humans, scoring 2.5-6x higher than base models. We view these results as an early promising sign for LLM self-play in cooperative settings, despite a lack of theoretical guarantees.

Efficacy of LLM Self-Play in Non-Zero-Sum Games

The paper "Efficacy of LLM Self-Play in Non-Zero-Sum Games" presents an empirical paper addressing the potential of self-play as a mechanism to enhance the performance of LLMs in contexts extending beyond strictly competitive environments. The investigation zeroes in on the "Deal or No Deal" (DoND) negotiation game, a versatile setting that can be tuned to be fully cooperative, partially competitive, or strictly competitive.

Key Contributions and Findings

  1. Game-Enriched Environment: The authors adapted the DoND game, enabling it to oscillate between cooperative, semi-competitive, and strictly competitive settings. This adaptation is pivotal, as it permits the evaluation of LLMs under various degrees of competitive and cooperative interaction.
  2. Self-Play Algorithm Implementation: An iterative finetuning strategy leveraging filtered behavior cloning was employed, where LLMs played multiple rounds of the game against themselves. Successful dialogues were retained, and fine-tuning proceeded based on these selections. This iterative process was repeated for ten rounds, with each round utilizing a pretrained model—GPT-3.5 in particular.
  3. Performance Metrics and Evaluation: Contrary to conventional wisdom suggesting limitations in applying self-play to non-zero-sum games, significant improvements were observed:
    • Scores: Self-play led to impressive enhancements, specifically in semi-competitive and cooperative settings. Performance gains up to 14 times (semi-competitive) and 17 times (cooperative) were reported. Notably, models trained via self-play managed scores superior to baseline GPT-4 models.
    • Generalization: The improvements were not confined to self-play scenarios but also extended to interactions involving human participants. Scores obtained in collaboration with humans showed increments by factors up to 6 in semi-competitive and 2 in cooperative contexts.
  4. Human Evaluation: Crowdsourced evaluations involving human participants were meticulously structured, incorporating bonus incentives aligned with achieving high game scores. This approach ensured high-quality data collection and evaluation of model generalization capabilities.
  5. Divergence in Strictly Competitive Games: Notably, the benefits of self-play were not replicated in strictly competitive settings (zero-sum scenarios). Models tended to overfit during self-play, failing to generalize when paired against other agents like GPT-4 or human participants.

Analytical Insights and Implications

  1. Dialogue Characteristics: The analysis revealed interesting patterns:
    • Dialogue lengths increased significantly in cooperative settings, denoting elaborative communication strategies.
    • Conversely, semi-competitive dialogues became more concise over iterations, potentially as a strategy to reach agreements faster.
    • Vocabulary usage also varied, contracting in semi-competitive games but expanding in cooperative interactions.
  2. Strategic Complexity: Despite performance enhancements, the increase in scores was largely attributed to an improved ability to reach valid agreements, rather than the adoption of advanced negotiation strategies. This suggests a strong potential for further performance boosts if self-play were combined with additional training methodologies.
  3. Filtered Behavioral Cloning: The observed improvements underscore the efficacy of filtered behavior cloning in steering models towards desirable behaviors through self-play, particularly in environments where LLMs can leverage their inherent generalization capabilities.

Implications and Future Directions

The promising outcomes from cooperative and semi-competitive settings suggest self-play as a viable technique for enhancing LLMs in broader real-world scenarios, beyond traditional game environments. Future research trajectories could explore:

  • Integration with Advanced Techniques: Combining self-play with methods such as reinforcement learning from human feedback (RLHF) or natural language reflections could potentially address current limitations in strategic depth.
  • Population Play and Diverse Objectives: Investigating more sophisticated self-play configurations, such as diverse agent populations, could further mitigate overfitting and foster better generalization.
  • Real-World Applications: Extending the application of self-play beyond synthetic or game-based contexts to real-world tasks could validate its utility in practical AI scenarios.

Conclusion

This paper offers compelling evidence for the efficacy of LLM self-play in enhancing performance across a spectrum of cooperative and competitive tasks, providing a foundation for future explorations toward more intelligent and versatile AI systems. While challenges remain, especially in strictly competitive contexts, the evidence points to significant prospects for self-play in refining and advancing the capabilities of LLMs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Austen Liao (2 papers)
  2. Nicholas Tomlin (10 papers)
  3. Dan Klein (99 papers)