Equilibrium Policy Generalization (EPG)
- EPG is a formal learning framework that synthesizes worst-case robust pursuit policies for multi-agent Markov games via reinforcement learning with oracle-guided imitation.
- It integrates graph neural networks with soft KL penalties to balance policy optimization against dynamic programming-derived reference actions for zero-shot generalization.
- Empirical results on synthetic and urban graph environments demonstrate EPG’s significant improvement in capture success, robustness under partial observability, and scalability.
Equilibrium Policy Generalization (EPG) is a formal learning framework and reinforcement learning (RL) instantiation for synthesizing robust pursuit strategies in multi-agent Markov games, particularly graph-based pursuit-evasion games (PEGs). EPG aims to generalize worst-case performance guarantees to previously unseen environments by training policies over ensembles of environments for which optimal oracle equilibria are available. In the context of pursuit-evasion, this method enables the derivation of pursuer policies that achieve robust, zero-shot generalization to new graph topologies, even in the presence of partial observability and asynchronous adversarial evasion (Lu et al., 21 Nov 2025).
1. Formal Optimization Objective and Equilibrium Constraints
EPG constructs a saddle-point learning problem over a distribution of environment graphs and initial states . For each , oracles for perfect-information equilibrium policies —typically computed by dynamic programming (DP)—are presumed available. The objective is to learn a single parameterized pursuer policy such that, for any unseen graph , yields near–worst-case value against the optimal evader . The central optimization is: where is the infinite-horizon discounted value for the induced Markov game on . In practice, the hard Kullback-Leibler (KL) constraint is implemented as a soft penalty: where is a standard RL policy gradient or discrete Soft Actor-Critic (SAC) loss, with controlling the trade-off between RL-driven exploitation and oracle-guided imitation (Lu et al., 21 Nov 2025).
This approach exploits the property that each is provably worst-case optimal on , so optimizing against these across many graphs induces an agent that captures structure-invariant principles of robust pursuit. The intended outcome is zero-shot generalization: achieves reliable performance even on graphs and initializations not seen during training.
2. Graph Neural Network Policy Architecture
EPG employs a homogeneous, decomposable policy for pursuers, represented sequentially as: At each step, the policy receives as input:
- : the set of possible evader positions
- : a belief distribution over the evader's location
A shared Graph Neural Network (GNN) backbone with masked self-attention layers processes these features into node embeddings. The encoder consists of 6 masked self-attention layers, each leveraging the graph adjacency mask : A query is formed using the pursuer's current focus node, followed by an unmasked attention-based pointer network over neighbor node embeddings to produce (Lu et al., 21 Nov 2025).
This parameter sharing makes the policy agnostic to graph size, topology, and degree, supporting zero-shot transfer across arbitrary graph families.
3. RL Algorithm, Training Losses, and Bellman Updates
The EPG learning process uses discrete Soft Actor-Critic as the backbone. Denoting as the critic and as the value network, the main update steps are:
- Soft Q-update:
- Value update:
- Policy actor update (w/o guidance):
- Policy actor update (EPG guidance):
with the DP reference action.
- Entropy coefficient update:
Training batches are generated by uniform sampling over graphs , initializations, and transitions, with adversarial evader moves sampled from (Lu et al., 21 Nov 2025).
4. Handling Partial Observability via Belief Preservation
To extend EPG into partially observable settings (where pursuers do not always know the evader’s location), EPG is integrated with a belief preservation module embodying the following constructs:
- Pos: Set of possible evader positions at time . Initialized to the true start, updated deterministically by neighborhood propagation and elimination of observed nodes.
- belief: Distribution over possible evader locations at time , updated recursively as: where defaults to uniform unless further information is available.
Two pursuit policies are then constructed:
- Position-worst-case:
$\mu(s_p, \mathrm{Pos}) = \arg\min_{n_p \in \Neigh(s_p)} \max_{n_e \in \Neigh(\mathrm{Pos})} D(n_p, n_e)$
- Belief-averaged:
$\mu(s_p, \mathrm{belief}) = \arg\min_{n_p} \frac{ \sum_{s_e} \mathrm{belief}(s_e)\, \max_{n_e \in \Neigh(s_e)} D(n_p, n_e)}{ \sum_{s_e} \mathrm{belief}(s_e) }$
When the possible-position set is a singleton, both policies coincide with the perfect-information DP policy (Lu et al., 21 Nov 2025).
5. Zero-Shot Experimental Protocols and Robustness Metrics
EPG’s effectiveness is assessed under an experimental protocol focusing on generalization and worst-case robustness:
- Training environments: 150 synthetic random graphs (grids, dungeons) and 150 urban subgraphs from Google Maps, all with .
- Testing (zero-shot): 10 previously unseen graphs (e.g., grids, Scotland-Yard, Google-Maps Downtown, notable landmarks).
- Opponents: Four evader behaviors—static, DP synchronous, DP asynchronous, best-responder asynchronous (the latter trained against the learned ).
- Evaluation metric: Success rate, defined as capture within 128 steps averaged over 500 random initializations.
- Baselines: PSRO (Policy Space Response Oracles) trained directly on the test graphs, and extended DP pursuer baselines leveraging Pos and belief modules.
Empirical results demonstrate that EPG-trained GNN policies attain $50$– capture success under all four evader types, consistently outperforming policies directly trained on test graphs with PSRO. Ablations show that belief averaging outperforms simple position set tracking, larger observation radii enhance performance monotonically, and more accurate belief propagation further improves robustness. Scalability is substantiated: inference on graphs with nodes completes in under $0.01$ seconds, while naive DP recomputation exceeds $60$ seconds (Lu et al., 21 Nov 2025).
6. Significance, Limitations, and Extensions
EPG, when combined with graph neural policies and belief preservation, constitutes the first framework for computing worst-case robust, real-time pursuer policies for pursuit-evasion games that generalize zero-shot across new graphs and operate under both partial and asynchronous observability constraints.
A plausible implication is that the EPG paradigm, instantiated with efficient RL and GNN architectures, resolves key scalability bottlenecks inherent in classical DP approaches, while simultaneously preserving robustness properties previously unattainable by neural or meta-RL methods. However, the practical deployment of EPG relies on the availability of equilibrium oracles for the full training suite, and its performance bounds are inherited from the underlying DP and belief update fidelity. Extensions of the framework to alternative multi-agent competitive domains, or to more complex observation models, represent avenues for further investigation (Lu et al., 21 Nov 2025).