Mastering the Game of No-Press Diplomacy via Human-Regularized Reinforcement Learning and Planning (2210.05492v1)

Published 11 Oct 2022 in cs.GT, cs.AI, cs.LG, and cs.MA

Abstract: No-press Diplomacy is a complex strategy game involving both cooperation and competition that has served as a benchmark for multi-agent AI research. While self-play reinforcement learning has resulted in numerous successes in purely adversarial games like chess, Go, and poker, self-play alone is insufficient for achieving optimal performance in domains involving cooperation with humans. We address this shortcoming by first introducing a planning algorithm we call DiL-piKL that regularizes a reward-maximizing policy toward a human imitation-learned policy. We prove that this is a no-regret learning algorithm under a modified utility function. We then show that DiL-piKL can be extended into a self-play reinforcement learning algorithm we call RL-DiL-piKL that provides a model of human play while simultaneously training an agent that responds well to this human model. We used RL-DiL-piKL to train an agent we name Diplodocus. In a 200-game no-press Diplomacy tournament involving 62 human participants spanning skill levels from beginner to expert, two Diplodocus agents both achieved a higher average score than all other participants who played more than two games, and ranked first and third according to an Elo ratings model.

PDF Abstract

Mastering No-Press Diplomacy via Human-Regularized Reinforcement Learning

The paper provides a detailed exploration of an innovative AI agent training approach for the game of no-press Diplomacy, a seven-player strategy game that presents a blend of cooperative and competitive elements without direct communication between players. The authors propose a novel reinforcement learning (RL) framework called RL-DiL-piKL, designed to enhance the agent's performance when interacting with human players by incorporating elements of human behavioral modeling.

Key Contributions

A focal innovation in this work is the Distributional Lambda piKL (DiL-piKL) algorithm, which implements a probabilistic mixture over regularization parameters in reinforcement learning. Unlike previous methods that use a fixed parameter $\lambda$ , DiL-piKL samples from a distribution in each iteration, allowing the AI to capture a diverse range of human-like behaviors. This enables the agent to adapt better to human strategies, which often involve implicit conventions beyond mere utility maximization.

Methodology

DiL-piKL is embedded within a self-play RL framework, where the goal is to iteratively refine both the policy and value functions. This involves:

Policy Regularization: The introduction of the DiL-piKL planning algorithm that uses a reward-maximizing policy adjusted towards human-like behavior.
Self-Play Reinforcement Learning: The RL-DiL-piKL extension trains agents by simulating interactions with other agents that model human play.

The agent, named Diplodocus, was tested through rigorous simulations against baseline AI agents and real human players. Notably, Diplodocus consistently ranked higher than its peers in a large-scale tournament setting.

Strong Numerical Findings

Diplodocus, leveraging the RL-DiL-piKL algorithm, exhibited superior performance:

In a 200-game tournament involving human participants, Diplodocus achieved the top two average scores, indicating its strength in adapting to a live human play environment.
The Elo ratings model used to assess performance demonstrated that Diplodocus securely outperformed other agents when engaged in mixed human-agent matches.

Implications and Future Directions

The research herein suggests that modeling human-like behavior in AI agents enhances their adaptability and effectiveness in cooperative-competitive contexts. Furthermore, this approach broadens the horizons of multi-agent coordination problems where agents must interact with both human players and other AI agents.

Potential future research extensions could involve exploring similar regularization strategies in games incorporating complex communications or negotiations, potentially bridging the gap between pure computational optimality and human-centric cooperation strategies.

In conclusion, the RL-DiL-piKL framework offers a robust pathway toward creating sophisticated AI systems capable of intuitive and strategic interactions in multi-agent environments, marking a significant stride in the quest for human-compatible AI.