- The paper presents Self Other-Modeling (SOM), a method for multi-agent reinforcement learning where agents use their own strategy to infer others' hidden goals from observable behavior.
- SOM is implemented with two neural networks per agent (for self-action/value and opponent goal estimation) that share parameters and refine goal estimates via backward optimization over inferred actions.
- Experiments in cooperative and adversarial games demonstrate that SOM consistently outperforms traditional baselines by enabling superior inference of opponent goals, leading to significantly higher cumulative rewards.
Modeling Others using Oneself in Multi-Agent Reinforcement Learning
The paper presents an approach for multi-agent reinforcement learning (MARL) in environments with imperfect information, where agents must decipher other agents' hidden goals based on observable behaviors. The authors propose Self Other-Modeling (SOM), where an agent employs its own strategy to anticipate another agent's actions, refining its belief about their hidden states dynamically. This process allows agents to improve policy learning by leveraging inferred information about other agents' goals.
Methodological Approach
The core of the presented approach is the SOM framework which enables an agent to model another agent using its own policy infrastructure. In this paradigm, each agent evolves within a two-player stochastic game, a framework where agents lack communication channels and must deduce each other's hidden goals to maximize their respective utilities. This challenge is akin to a partially observable Markov decision process where the optimal policy formulation must consider both self and inferred adversarial goals.
The SOM is implemented via two neural networks per agent: one for self-action and value determination, the other for estimating the opponent's hidden goals. Both networks share parameters but differ in input configuration, maintaining the overarching goal of self-modeling. Notably, adjustment of the other agent's goal is orchestrated via a backward optimization over inferred actions, powering the SOM's ability to progressively align its predictive model with actual observed behaviors during episodes.
Experimental Validation
To evaluate the appropriateness and efficacy of SOM, experiments were conducted across three games, each requiring varying degrees of cooperative and adversarial strategies. The tasks included a cooperative Coin Game, an adversarial Recipe Game, and a partially cooperative Door Game with asymmetric roles. In these experiments, SOM consistently outperformed traditional baselines, demonstrating superior capability in inferring others' goals to maximize cumulative rewards. In particular, the Coin Game showed SOM could distinguish efficiently between self, other-predictive, and non-predictive elements in the environment, leading to more optimal joint strategies with cooperating partners.
A notable aspect of SOM's performance is captured in the Coin Game experiments, where the framework demonstrated a significant advantage in anticipating co-player strategies, resulting in a substantial increase in overall reward. This supports the assertion that well-formed opponent models embedded within agent policies enrich strategic decision-making markedly beyond what reactive or policy-driven models achieve.
Implications and Future Directions
The insights gathered from employing SOM in MARL environments underscore its potential for diverse applications in tasks that demand strategic inference and adaptation to other agents' goals. The method's simplicity, coupled with its adaptability to neural architectures and reinforcement learning protocols, renders it attractive for broader applications in distributed multi-agent systems and games.
Potential extensions of this work include broadening the application scope to include environments with more agents and variable types, incorporating hierarchical goal structures, and developing features to adapt to dynamic opponent strategies. Furthermore, the approach bears implications in human-robot interactions, particularly for improving collaborative task performance, and can serve as the foundation for more tangible AI-human interface models in mixed-agent domains.
In summary, SOM establishes a compelling illustration of leveraging self-models for dynamic goal inference in multi-agent scenarios, with significant implications for both theoretical developments and real-world applications in multi-agent systems.