Simultaneous AlphaZero in Multi-Agent Markov Games

Updated 20 December 2025

The paper introduces Simultaneous AlphaZero, which extends the classic AlphaZero by handling joint action selection in simultaneous-action Markov games.
It embeds matrix game solvers into MCTS using UCB-based exploration and neural network estimates, reducing exploitability in benchmark domains like Dubin Tag and custody maintenance.
The research offers theoretical guarantees and practical improvements in strategic decision-making while addressing computational challenges for large joint action spaces.

Simultaneous AlphaZero generalizes the AlphaZero paradigm to environments where multiple agents act simultaneously at each decision point. Unlike the original sequential setting, simultaneous-action Markov games require joint action selection and equilibrium-solving within tree search, necessitating new algorithmic mechanisms to handle the resulting strategic interaction and partial observability induced by bandit feedback. Modern approaches address these challenges by embedding matrix game solvers into planning, leveraging advanced regret minimization and neural networks for policy and value estimation. This article reviews core concepts, algorithmic methodology, theoretical guarantees, empirical studies, and frontiers in simultaneous AlphaZero research.

1. Problem Setting: Markov Games with Simultaneous Actions

Simultaneous AlphaZero operates in the framework of deterministic, two-player, zero-sum Markov games, where at each state $s$ , both players $i \in \{1,2\}$ simultaneously select actions $a^i$ from finite action sets $\mathcal{A}^i$ . The system is formally defined by the tuple

$(\mathcal{S},\,\mathcal{A}^1,\,\mathcal{A}^2,\,T,\,r,\gamma),$

where $\mathcal{S}$ is the state space (discrete or continuous), $T:\mathcal{S}\times\mathcal{A}^1\times\mathcal{A}^2\to\mathcal{S}$ describes deterministic transitions, $r^1(s,a^1,a^2) = -r^2(s,a^1,a^2)$ assigns zero-sum rewards, and $\gamma\in[0,1]$ is a discount factor. A (potentially stochastic) policy $\pi^i:\mathcal{S}\to\Delta(\mathcal{A}^i)$ maps states to distributions over actions.

The value for player 1 under joint policy $(\pi^1,\pi^2)$ starting from state $s$ is

$V^{\pi^1,\pi^2}(s) = \mathbb{E}\Big[\sum_{t=0}^\infty \gamma^t r^1(s_t,a^1_t,a^2_t) \,\Big|\, s_0 = s,\, a^i_t \sim \pi^i(s_t)\Big].$

A Nash equilibrium is attained when each policy is the best response to the other; that is,

$V^{\pi^{1*},\pi^{2*}}(s) = \max_{\pi^1} \min_{\pi^2} V^{\pi^1,\pi^2}(s) = \min_{\pi^2} \max_{\pi^1} V^{\pi^1,\pi^2}(s).$

This formulation underpins the necessity for joint action selection and matrix game solving at each stage of tree search (Becker et al., 13 Dec 2025).

2. Algorithmic Structure: Simultaneous Action Tree Search

2.1 Matrix Game at Tree Nodes

In turn-based AlphaZero, each MCTS node conducts a scalar argmax to select the action. In contrast, Simultaneous AlphaZero builds a local zero-sum matrix game at each node, with the payoff matrix for player 1 defined via

$Q_{ij}(s) = r^1(s, a^1_i, a^2_j) + \gamma \hat V(T(s, a^1_i, a^2_j)),$

where $\hat V$ is the current value estimate for the successor state and $(a^1_i, a^2_j)$ enumerate the action spaces.

2.2 Exploration–Exploitation via Bandit-Augmented Payoff

Because only the sampled payoff $Q_{i_t j_t}(s)$ can be observed at each rollout, Simultaneous AlphaZero applies a UCB-style augmentation to the matrix game:

$\widetilde Q_{ij}(s) = Q_{ij}(s) + c_{\rm PUCT} \frac{P(s,a^1_i) P(s,a^2_j) \sqrt{N(s)}}{1 + N(s,a^1_i,a^2_j)},$

where $N(s)$ is the total visit count for node $s$ , $N(s,a^1_i,a^2_j)$ is the joint action count, $P(s,a^1_i), P(s,a^2_j)$ are prior probabilities from the policy network, and $c_{\rm PUCT}$ is a tunable constant.

A regret-minimizing matrix game solver processes $\widetilde Q(s)$ under bandit feedback, returning stochastic strategies $\tilde\pi^1(\cdot|s), \tilde\pi^2(\cdot|s)$ . The next joint action is then sampled and the tree is traversed recursively.

2.3 MCTS with Simultaneous Moves: Pseudocode

A single MCTS simulation for Simultaneous AlphaZero is (cf. (Becker et al., 13 Dec 2025)):

Function Simulate(node n):
  s ← state(n)
  if terminal(s):
    return 0
  if n is leaf:
    Expand(n)
    Vs ← SolveLocalMatrixGame(n)
    return Vs
  else:
    (π1,π2) ← SolveAugmentedMatrixGame(n)
    sample a1∼π1, a2∼π2
    child ← child_of(n,(a1,a2))
    v_child ← Simulate(child)
    update N(s), N(s,a1,a2)
    Q(s)_{ij} ← R(s,a1_i,a2_j) + γ·V(child_ij)
    (π1,π2, Vs) ← SolveLocalMatrixGame(n)
    return Vs

At tree root, the equilibrium strategies are extracted by solving the root's matrix game without UCB terms.

3. Regret-Minimizing Matrix Bandit Solver

Crucial to Simultaneous AlphaZero is the use of a regret-optimal solver for matrix games under bandit feedback (Becker et al., 13 Dec 2025). At each node, repeated play over $T$ steps collects observed entries $Q_{i_t j_t}$ only for sampled $(i_t, j_t)$ . The external regret for player 1 grows as

$R_T^1 = \max_i \sum_{t=1}^T Q_{i, j_t} - \sum_{t=1}^T Q_{i_t, j_t}.$

A solver such as the UCB-augmented method of O’Donoghue et al. (2021) guarantees $R_T^1 = O(\sqrt{T \log T})$ . At each iteration, empirical means $\widehat Q^t_{ij}$ and counts $n_{ij}(t)$ are combined as

$\widetilde Q^t_{ij} = \widehat Q^t_{ij} + c\sqrt{\frac{\log t}{1 + n_{ij}(t)}},$

with strategies refined via regret-matching or linear programming. This process ensures convergence to the Nash equilibrium despite partial feedback.

4. Empirical Evaluation Across Benchmark Domains

Simultaneous AlphaZero’s performance has been validated in two principal domains (Becker et al., 13 Dec 2025):

4.1 Continuous-State Pursuit–Evasion (Dubin Tag)

State: Relative 2D positions and agent headings.
Dynamics: Dubin vehicle with discrete angular controls.
Reward: +1 for attacker reaching a goal, +1 for defender interception, 0 otherwise.
Training: 50k self-play episodes, evaluated against a full-information best-response solver.
Metrics: Best-response value and exploitability $e(\pi)$ .

Key findings:

Defender’s policy exploitability decreases during training.
Incorporating 500 MCTS simulations at test-time achieves an additional $\sim$ 30% reduction in exploitability relative to the raw network policy.

4.2 Space Domain Awareness – Custody Maintenance

State: Relative orbital geometry, illumination, visibility masks.
Actions: Discrete thrust maneuvers for observer and target.
Reward: +1 per maintained custody step, $-1$ per occlusion.
Training: Analogous self-play and evaluation as above.

Highlights:

Learned value networks replicate the expected geometry (concentric custody retention in daylight, dips in eclipse).
Raw-network exploitability reduces from $\sim$ 0.4 to $\sim$ 0.1 through training.
MCTS at test-time further halves exploitability, underscoring the robustness of planning over the learned policy.

5. Theoretical Limitations and Directions for Extension

Simultaneous AlphaZero, as formalized in (Becker et al., 13 Dec 2025), is restricted to deterministic, two-player, zero-sum settings with finite action spaces. Computational complexity per tree node is $O(|\mathcal{A}^1|\times|\mathcal{A}^2|)$ , which can be prohibitive for large joint action spaces.

Proposed extensions include:

Stochastic transitions and partial observability.
Continuous-action adaptations (e.g., covariance-matrix adaptation within tree search).
Generalization to more than two players and to general-sum payoffs.
Improved theoretical analysis of exploitability bounds with function approximation.

A plausible implication is that these limitations currently constrain real-world applicability to domains with moderate agent and action space cardinality, though methodological advances in continuous-action reinforcement learning and scalable equilibrium computation may unlock broader utility.

Alternative algorithms for simultaneous-move games, such as Albatross (Mahlau et al., 2024), replace PUCT/MCTS with fixed-depth lookahead and temperature-parameterized logit equilibria (SBRLE), targeting cooperation and bounded rationality. These methods model agent heterogeneity or rationality explicitly and perform online opponent modeling during test-time interactions, contrasting with Simultaneous AlphaZero’s Nash equilibrium approach per node.

Other simultaneous learning paradigms—e.g., AlphaViT for multi-game simultaneous training (Fujita, 2024) and AZ_db for league-based diversity in simultaneous multi-agent play (Zahavy et al., 2023)—focus on shared neural architectures and diversity-promoting objectives, but do not substitute joint-action matrix game solving at the planning level.

7. Significance and Broader Impact

Simultaneous AlphaZero establishes a principled approach to planning in Markov games where agents act concurrently. The integration of regret-minimizing matrix bandit solvers into MCTS generalizes the AlphaZero recipe to a substantial new class of adversarial and cooperative environments, with practical advantages in robustness against sophisticated or exploitative opponents. The algorithmic design and empirical validation indicate a robust framework for multistep, simultaneous-action environments, and a foundation for future research on more general and scalable forms of multi-agent tree search (Becker et al., 13 Dec 2025).