Modeling Others using Oneself in Multi-Agent Reinforcement Learning (1802.09640v3)

Published 26 Feb 2018 in cs.AI and cs.LG

Abstract: We consider the multi-agent reinforcement learning setting with imperfect information in which each agent is trying to maximize its own utility. The reward function depends on the hidden state (or goal) of both agents, so the agents must infer the other players' hidden goals from their observed behavior in order to solve the tasks. We propose a new approach for learning in these domains: Self Other-Modeling (SOM), in which an agent uses its own policy to predict the other agent's actions and update its belief of their hidden state in an online manner. We evaluate this approach on three different tasks and show that the agents are able to learn better policies using their estimate of the other players' hidden states, in both cooperative and adversarial settings.

Citations (193)

View on Semantic Scholar

Summary

The paper presents Self Other-Modeling (SOM), a method for multi-agent reinforcement learning where agents use their own strategy to infer others' hidden goals from observable behavior.
SOM is implemented with two neural networks per agent (for self-action/value and opponent goal estimation) that share parameters and refine goal estimates via backward optimization over inferred actions.
Experiments in cooperative and adversarial games demonstrate that SOM consistently outperforms traditional baselines by enabling superior inference of opponent goals, leading to significantly higher cumulative rewards.

Modeling Others using Oneself in Multi-Agent Reinforcement Learning

The paper presents an approach for multi-agent reinforcement learning (MARL) in environments with imperfect information, where agents must decipher other agents' hidden goals based on observable behaviors. The authors propose Self Other-Modeling (SOM), where an agent employs its own strategy to anticipate another agent's actions, refining its belief about their hidden states dynamically. This process allows agents to improve policy learning by leveraging inferred information about other agents' goals.

Methodological Approach

The core of the presented approach is the SOM framework which enables an agent to model another agent using its own policy infrastructure. In this paradigm, each agent evolves within a two-player stochastic game, a framework where agents lack communication channels and must deduce each other's hidden goals to maximize their respective utilities. This challenge is akin to a partially observable Markov decision process where the optimal policy formulation must consider both self and inferred adversarial goals.

The SOM is implemented via two neural networks per agent: one for self-action and value determination, the other for estimating the opponent's hidden goals. Both networks share parameters but differ in input configuration, maintaining the overarching goal of self-modeling. Notably, adjustment of the other agent's goal is orchestrated via a backward optimization over inferred actions, powering the SOM's ability to progressively align its predictive model with actual observed behaviors during episodes.

Experimental Validation

To evaluate the appropriateness and efficacy of SOM, experiments were conducted across three games, each requiring varying degrees of cooperative and adversarial strategies. The tasks included a cooperative Coin Game, an adversarial Recipe Game, and a partially cooperative Door Game with asymmetric roles. In these experiments, SOM consistently outperformed traditional baselines, demonstrating superior capability in inferring others' goals to maximize cumulative rewards. In particular, the Coin Game showed SOM could distinguish efficiently between self, other-predictive, and non-predictive elements in the environment, leading to more optimal joint strategies with cooperating partners.

A notable aspect of SOM's performance is captured in the Coin Game experiments, where the framework demonstrated a significant advantage in anticipating co-player strategies, resulting in a substantial increase in overall reward. This supports the assertion that well-formed opponent models embedded within agent policies enrich strategic decision-making markedly beyond what reactive or policy-driven models achieve.

Implications and Future Directions

The insights gathered from employing SOM in MARL environments underscore its potential for diverse applications in tasks that demand strategic inference and adaptation to other agents' goals. The method's simplicity, coupled with its adaptability to neural architectures and reinforcement learning protocols, renders it attractive for broader applications in distributed multi-agent systems and games.

Potential extensions of this work include broadening the application scope to include environments with more agents and variable types, incorporating hierarchical goal structures, and developing features to adapt to dynamic opponent strategies. Furthermore, the approach bears implications in human-robot interactions, particularly for improving collaborative task performance, and can serve as the foundation for more tangible AI-human interface models in mixed-agent domains.

In summary, SOM establishes a compelling illustration of leveraging self-models for dynamic goal inference in multi-agent scenarios, with significant implications for both theoretical developments and real-world applications in multi-agent systems.