Two-Agent Adversarial Flow Networks
- The paper introduces AFlowNets, an off-policy self-play framework that minimizes trajectory-balance loss to achieve unique Nash equilibrium strategies.
- Methodology leverages adversarial extensions of GFlowNets with branch-adjustment to bias moves towards shorter wins in turn-based zero-sum games.
- Empirical results in Connect-4 demonstrate that AFlowNets outperform AlphaZero by achieving over 80% optimal move rates under comparable computational constraints.
Two-Agent Adversarial Flow Networks (AFlowNets) are a variant of generative flow networks (GFlowNets) designed for modeling and solving two-player zero-sum games within a flow-based formalism. By extending expected flow network (EFlowNet) principles to adversarial, turn-based environments, AFlowNets provide a theoretically grounded, off-policy self-play algorithm that seeks a unique Nash equilibrium policy pair through the minimization of trajectory-balance loss. Empirical evaluations demonstrate that AFlowNets achieve high optimal move rates and outperform AlphaZero in the classical game of Connect-4 under comparable computational constraints (Jiralerspong et al., 2023).
1. Background and Theoretical Foundation
GFlowNets are sequential generative models that sample trajectories on directed acyclic trees, producing objects (terminal states) with probability proportional to a predefined reward function. Each state is associated with a non-negative flow , and policy governs transitions along the tree. The flow-matching constraints require
$F(s) = \sum_{s' \in \Ch(s)} F(s'),\quad \forall s \notin \mathcal X, \qquad F(x) = R(x),\quad \forall x \in \mathcal X$
where $\Ch(s)$ denotes children of and is the reward function on terminal states .
EFlowNets extend this formalism to stochastic environments, partitioning nonterminal states into agent states and environment states. For environment states , transitions follow a fixed stochastic kernel . The EFlowNets introduce expected detailed balance (EDB) constraints:
- Agent step: 0 for 1.
- Environment step: 2 for 3.
- Terminal step: 4 for 5.
Flows and agent policies are uniquely determined under mild conditions; when no environment states are present, this reduces to standard (deterministic) GFlowNets (Jiralerspong et al., 2023).
2. Adversarial Flow Networks for Two-Player Zero-Sum Games
In the two-agent adversarial context, the game tree 6 is partitioned into disjoint player states 7 and terminal states 8. Player 9 acts at states 0, while the other player’s actions are modeled as part of a stochastic environment from 1’s perspective.
Each player maintains:
- A flow function 2,
- A policy 3 for 4.
The adversarial EDB constraints for player 5 are:
- Own move: 6 for 7.
- Opponent move: 8 for 9 ($F(s) = \sum_{s' \in \Ch(s)} F(s'),\quad \forall s \notin \mathcal X, \qquad F(x) = R(x),\quad \forall x \in \mathcal X$0).
- Terminal: $F(s) = \sum_{s' \in \Ch(s)} F(s'),\quad \forall s \notin \mathcal X, \qquad F(x) = R(x),\quad \forall x \in \mathcal X$1 where $F(s) = \sum_{s' \in \Ch(s)} F(s'),\quad \forall s \notin \mathcal X, \qquad F(x) = R(x),\quad \forall x \in \mathcal X$2 to satisfy a zero-sum relationship (in log-domain).
To bias toward shorter wins, a branch-adjustment is introduced: $F(s) = \sum_{s' \in \Ch(s)} F(s'),\quad \forall s \notin \mathcal X, \qquad F(x) = R(x),\quad \forall x \in \mathcal X$3 with $F(s) = \sum_{s' \in \Ch(s)} F(s'),\quad \forall s \notin \mathcal X, \qquad F(x) = R(x),\quad \forall x \in \mathcal X$4 for win, $F(s) = \sum_{s' \in \Ch(s)} F(s'),\quad \forall s \notin \mathcal X, \qquad F(x) = R(x),\quad \forall x \in \mathcal X$5 for draw, $F(s) = \sum_{s' \in \Ch(s)} F(s'),\quad \forall s \notin \mathcal X, \qquad F(x) = R(x),\quad \forall x \in \mathcal X$6 for loss.
3. Trajectory-Balance Constraints and Training Objective
The global trajectory-balance (TB) constraint offers an alternative to local EDB constraints. For any trajectory $F(s) = \sum_{s' \in \Ch(s)} F(s'),\quad \forall s \notin \mathcal X, \qquad F(x) = R(x),\quad \forall x \in \mathcal X$7: $F(s) = \sum_{s' \in \Ch(s)} F(s'),\quad \forall s \notin \mathcal X, \qquad F(x) = R(x),\quad \forall x \in \mathcal X$8 where $F(s) = \sum_{s' \in \Ch(s)} F(s'),\quad \forall s \notin \mathcal X, \qquad F(x) = R(x),\quad \forall x \in \mathcal X$9 is unique.
The trajectory-balance loss for a trajectory $\Ch(s)$0 is
$\Ch(s)$1
Parameters $\Ch(s)$2 are updated by minimizing the expected TB loss over sampled self-play trajectories. The resulting Nash equilibrium is unique (Jiralerspong et al., 2023).
4. Training Algorithm and Implementation
Training proceeds via on-policy self-play, coupled with experience replay. The algorithm alternates between two phases:
- Self-play Episode Generation: $\Ch(s)$3 complete episodes are played by alternating $\Ch(s)$4 and $\Ch(s)$5, with all trajectories stored in a buffer $\Ch(s)$6.
- Parameter Updates: For $\Ch(s)$7 steps, batches of trajectories $\Ch(s)$8 are sampled from $\Ch(s)$9; gradients 0, 1, and 2 of the summed 3 are computed and used to perform gradient descent steps.
This approach enables off-policy updates, facilitating sample efficiency and stabilizing learning due to the existence of a unique minimax equilibrium.
5. Empirical Evaluation in Connect-4
AFlowNets have been evaluated on the Connect-4 environment, with states defined as board configurations plus the current player. Legal actions are drop-column moves, and the tree structure is memory-augmented to prevent transpositions.
Metrics include:
- Elo rating (BayesElo) from matches against uniform random agents and AlphaZero (with and without MCTS during test time).
- Optimal move rate: The fraction of positions in which the chosen move matches a minimax solver.
Key reported results after 3 hours on a single RTX 8000 GPU include:
- AFlowNet4 selects the minimax move in over 80% of positions.
- Tournament Elo (mean ± std, 3 seeds):
| Agent | Elo | |-------------------|--------------------| | AFlowNet5 | 1190.8 ± 64.2 | | AFlowNet6 | 1700.1 ± 60.0 | | AFlowNet7 | 1835.3 ± 154.9 | | AlphaZero (no MCTS) | ~700 | | AlphaZero + MCTS | ~900 |
- Win–draw–loss scores (first-player view, out of 50 games) against AlphaZero + MCTS:
| Agent | vs AFlowNet8 | vs AFlowNet9 | vs AFlowNet0 | |---------------|-----------------|--------------------|--------------------| | AFlowNet1 | - | 0–0–50 | 5–0–45 | | AFlowNet2 | 50–0–0 | - | 20–0–30 | | AFlowNet3 | 50–0–0 | 35–0–15 | - | | AlphaZero+MCTS | 25–0–25 | 53–0–47 | 50–0–50 |
This demonstrates that AFlowNets substantially outperform AlphaZero baselines, especially when acting without MCTS for inference (Jiralerspong et al., 2023).
6. Advantages, Limitations, and Extensions
AFlowNets confer several benefits:
- Single-pass policy rollout—eliminating the need for Monte Carlo tree search during training and inference.
- Off-policy self-play with replay buffers, facilitating convergence to a unique equilibrium without cyclic instability.
- Adaptability to both stochastic transitions (via EFlowNet formalism) and adversarial, turn-based interactions.
Limitations include:
- High variance of trajectory-balance objectives for long games and the requirement to store entire episodes.
- Necessity to tune the branch-adjustment factor 4 and the hyperparameter 5 to balance the diversity of exploration and exploitation.
- Scalability to very large games, such as chess or Go, may require future advances, e.g., subtrajectory-based trajectory-balance formulations.
Potential extensions—motivated by the generality of the formalism—include multi-player and general-sum games, latent-variable modeling for incomplete-information games, adaptation to continuous action spaces, and hybridization of AFlowNet policies with limited-depth look-ahead or search.
AFlowNets thus import the diversity-seeking, off-policy sampling virtues of GFlowNets into adversarial domains, yielding a principled and effective self-play learning method (Jiralerspong et al., 2023).