Equilibrium Policy Generalization (EPG)

Updated 28 November 2025

EPG is a formal learning framework that synthesizes worst-case robust pursuit policies for multi-agent Markov games via reinforcement learning with oracle-guided imitation.
It integrates graph neural networks with soft KL penalties to balance policy optimization against dynamic programming-derived reference actions for zero-shot generalization.
Empirical results on synthetic and urban graph environments demonstrate EPG’s significant improvement in capture success, robustness under partial observability, and scalability.

Equilibrium Policy Generalization (EPG) is a formal learning framework and reinforcement learning (RL) instantiation for synthesizing robust pursuit strategies in multi-agent Markov games, particularly graph-based pursuit-evasion games (PEGs). EPG aims to generalize worst-case performance guarantees to previously unseen environments by training policies over ensembles of environments for which optimal oracle equilibria are available. In the context of pursuit-evasion, this method enables the derivation of pursuer policies that achieve robust, zero-shot generalization to new graph topologies, even in the presence of partial observability and asynchronous adversarial evasion (Lu et al., 21 Nov 2025).

1. Formal Optimization Objective and Equilibrium Constraints

EPG constructs a saddle-point learning problem over a distribution $\mathcal D$ of environment graphs $G_i$ and initial states $s_0$ . For each $G_i$ , oracles for perfect-information equilibrium policies $(\mu_i^*,\nu_i^*)$ —typically computed by dynamic programming (DP)—are presumed available. The objective is to learn a single parameterized pursuer policy $\pi_\theta$ such that, for any unseen graph $G_j\not\in\mathcal{G}$ , $\pi_\theta$ yields near–worst-case value against the optimal evader $\nu_j^*$ . The central optimization is: $\max_{\pi\in\Pi}\; \min_{G_i\sim\mathcal D}\; V^{\,\pi,\,\nu_i^*}_{G_i}(s_{0}) \quad\text{s.t.}\quad \forall s,\;\mathrm{KL}\bigl(\mu_i^*(s)\;\|\;\pi(s)\bigr)\le\varepsilon$ where $V^{\pi,\nu}_G(s_0)$ is the infinite-horizon discounted value for the induced Markov game on $G$ . In practice, the hard Kullback-Leibler (KL) constraint is implemented as a soft penalty: $\mathcal L(\theta\mid s) = J_{\pi}(\theta\mid s) + \beta\,\mathrm{KL}(\mu_i^*(s)\|\pi_\theta(s)),$ where $J_\pi$ is a standard RL policy gradient or discrete Soft Actor-Critic (SAC) loss, with $\beta>0$ controlling the trade-off between RL-driven exploitation and oracle-guided imitation (Lu et al., 21 Nov 2025).

This approach exploits the property that each $\nu_i^*$ is provably worst-case optimal on $G_i$ , so optimizing $\pi_\theta$ against these across many graphs induces an agent that captures structure-invariant principles of robust pursuit. The intended outcome is zero-shot generalization: $\pi_\theta$ achieves reliable performance even on graphs and initializations not seen during training.

2. Graph Neural Network Policy Architecture

EPG employs a homogeneous, decomposable policy for $m$ pursuers, represented sequentially as: $\pi(a_1,\dots,a_m\mid s) = \prod_{\ell=1}^m \pi(a_\ell\mid s, a_1,\dots, a_{\ell-1})$ At each step, the policy receives as input:

$[\,\text{shortest-path distances of all nodes to each pursuer}\,] \in\mathbb{R}^n$
$\text{Pos}$ : the set of possible evader positions
$\text{belief}$ : a belief distribution over the evader's location

A shared Graph Neural Network (GNN) backbone with masked self-attention layers processes these features into node embeddings. The encoder consists of 6 masked self-attention layers, each leveraging the graph adjacency mask $M$ : $u_{ij} = \frac{q_i^\top k_j}{\sqrt{d}},\,\, w_{ij} = \frac{\exp(u_{ij})}{\sum_j \exp(u_{ij})},\,\, h'_i = \sum_{j=1}^n \bigl(w_{ij} \cdot M_{ij}\bigr) v_j$ A query is formed using the pursuer's current focus node, followed by an unmasked attention-based pointer network over neighbor node embeddings to produce $\pi(a_\ell|\cdot)$ (Lu et al., 21 Nov 2025).

This parameter sharing makes the policy agnostic to graph size, topology, and degree, supporting zero-shot transfer across arbitrary graph families.

3. RL Algorithm, Training Losses, and Bellman Updates

The EPG learning process uses discrete Soft Actor-Critic as the backbone. Denoting $Q_\phi(s,a)$ as the critic and $V_\psi(s)$ as the value network, the main update steps are:

Soft Q-update:

$J_Q(\phi) = \mathbb{E}\left[\frac{1}{2} \big(Q_\phi(s,a) - (r + \gamma V_\psi(s'))\big)^2\right]$

Value update:

$V_\psi(s) \leftarrow \mathbb{E}_{a\sim\pi_\theta(s)}\big[Q_\phi(s,a) - \alpha\log\pi_\theta(s,a)\big]$

Policy actor update (w/o guidance):

$J_\pi(\theta) = \mathbb{E}_{s,a}\left[\alpha \log\pi_\theta(s,a) - Q_\phi(s,a)\right]$

Policy actor update (EPG guidance):

$\mathcal L(\theta|s) = J_\pi(\theta|s) + \beta\,\mathrm{KL}(\mu^*(s)\|\pi_\theta(s)) = J_\pi(\theta|s) - \beta \log\pi_\theta(s,a^*)$

with $a^* = \mu^*(s)$ the DP reference action.

Entropy coefficient update:

$J(\alpha) = \mathbb{E}_{s,a}\big[-\alpha(\log\pi_\theta(s,a) + \bar H)\big]$

Training batches are generated by uniform sampling over graphs $G_i$ , initializations, and transitions, with adversarial evader moves sampled from $\nu_i^*$ (Lu et al., 21 Nov 2025).

4. Handling Partial Observability via Belief Preservation

To extend EPG into partially observable settings (where pursuers do not always know the evader’s location), EPG is integrated with a belief preservation module embodying the following constructs:

Pos $_t \subseteq V$ : Set of possible evader positions at time $t$ . Initialized to the true start, updated deterministically by neighborhood propagation and elimination of observed nodes.
belief $_t : V \to [0,1]$ : Distribution over possible evader locations at time $t$ , updated recursively as: $\text{belief}_{t+1}(s_e) = \sum_{v \in \text{Neighbor}(s_e)}\nu(v, s_e)\,\text{belief}_t(v)$ where $\nu$ defaults to uniform unless further information is available.

Two pursuit policies are then constructed:

Position-worst-case:

$\mu(s_p, \mathrm{Pos}) = \arg\min_{n_p \in \Neigh(s_p)} \max_{n_e \in \Neigh(\mathrm{Pos})} D(n_p, n_e)$

Belief-averaged:

$\mu(s_p, \mathrm{belief}) = \arg\min_{n_p} \frac{ \sum_{s_e} \mathrm{belief}(s_e)\, \max_{n_e \in \Neigh(s_e)} D(n_p, n_e)}{ \sum_{s_e} \mathrm{belief}(s_e) }$

When the possible-position set is a singleton, both policies coincide with the perfect-information DP policy (Lu et al., 21 Nov 2025).

5. Zero-Shot Experimental Protocols and Robustness Metrics

EPG’s effectiveness is assessed under an experimental protocol focusing on generalization and worst-case robustness:

Training environments: 150 synthetic random graphs (grids, dungeons) and 150 urban subgraphs from Google Maps, all with $|V| \leq 500$ .
Testing (zero-shot): 10 previously unseen graphs (e.g., $10\times10$ grids, Scotland-Yard, Google-Maps Downtown, notable landmarks).
Opponents: Four evader behaviors—static, DP synchronous, DP asynchronous, best-responder asynchronous (the latter trained against the learned $\pi_\theta$ ).
Evaluation metric: Success rate, defined as capture within 128 steps averaged over 500 random initializations.
Baselines: PSRO (Policy Space Response Oracles) trained directly on the test graphs, and extended DP pursuer baselines leveraging Pos and belief modules.

Empirical results demonstrate that EPG-trained GNN policies attain $50$– $100\%$ capture success under all four evader types, consistently outperforming policies directly trained on test graphs with PSRO. Ablations show that belief averaging outperforms simple position set tracking, larger observation radii $R$ enhance performance monotonically, and more accurate belief propagation further improves robustness. Scalability is substantiated: inference on graphs with $|V|\approx 2000$ nodes completes in under $0.01$ seconds, while naive DP recomputation exceeds $60$ seconds (Lu et al., 21 Nov 2025).

6. Significance, Limitations, and Extensions

EPG, when combined with graph neural policies and belief preservation, constitutes the first framework for computing worst-case robust, real-time pursuer policies for pursuit-evasion games that generalize zero-shot across new graphs and operate under both partial and asynchronous observability constraints.

A plausible implication is that the EPG paradigm, instantiated with efficient RL and GNN architectures, resolves key scalability bottlenecks inherent in classical DP approaches, while simultaneously preserving robustness properties previously unattainable by neural or meta-RL methods. However, the practical deployment of EPG relies on the availability of equilibrium oracles $(\mu^*, \nu^*)$ for the full training suite, and its performance bounds are inherited from the underlying DP and belief update fidelity. Extensions of the framework to alternative multi-agent competitive domains, or to more complex observation models, represent avenues for further investigation (Lu et al., 21 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

R2PS: Worst-Case Robust Real-Time Pursuit Strategies under Partial Observability (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Equilibrium Policy Generalization (EPG).