Papers
Topics
Authors
Recent
2000 character limit reached

Equilibrium Policy Generalization (EPG)

Updated 28 November 2025
  • EPG is a formal learning framework that synthesizes worst-case robust pursuit policies for multi-agent Markov games via reinforcement learning with oracle-guided imitation.
  • It integrates graph neural networks with soft KL penalties to balance policy optimization against dynamic programming-derived reference actions for zero-shot generalization.
  • Empirical results on synthetic and urban graph environments demonstrate EPG’s significant improvement in capture success, robustness under partial observability, and scalability.

Equilibrium Policy Generalization (EPG) is a formal learning framework and reinforcement learning (RL) instantiation for synthesizing robust pursuit strategies in multi-agent Markov games, particularly graph-based pursuit-evasion games (PEGs). EPG aims to generalize worst-case performance guarantees to previously unseen environments by training policies over ensembles of environments for which optimal oracle equilibria are available. In the context of pursuit-evasion, this method enables the derivation of pursuer policies that achieve robust, zero-shot generalization to new graph topologies, even in the presence of partial observability and asynchronous adversarial evasion (Lu et al., 21 Nov 2025).

1. Formal Optimization Objective and Equilibrium Constraints

EPG constructs a saddle-point learning problem over a distribution D\mathcal D of environment graphs GiG_i and initial states s0s_0. For each GiG_i, oracles for perfect-information equilibrium policies (μi,νi)(\mu_i^*,\nu_i^*)—typically computed by dynamic programming (DP)—are presumed available. The objective is to learn a single parameterized pursuer policy πθ\pi_\theta such that, for any unseen graph Gj∉GG_j\not\in\mathcal{G}, πθ\pi_\theta yields near–worst-case value against the optimal evader νj\nu_j^*. The central optimization is: maxπΠ  minGiD  VGiπ,νi(s0)s.t.s,  KL(μi(s)    π(s))ε\max_{\pi\in\Pi}\; \min_{G_i\sim\mathcal D}\; V^{\,\pi,\,\nu_i^*}_{G_i}(s_{0}) \quad\text{s.t.}\quad \forall s,\;\mathrm{KL}\bigl(\mu_i^*(s)\;\|\;\pi(s)\bigr)\le\varepsilon where VGπ,ν(s0)V^{\pi,\nu}_G(s_0) is the infinite-horizon discounted value for the induced Markov game on GG. In practice, the hard Kullback-Leibler (KL) constraint is implemented as a soft penalty: L(θs)=Jπ(θs)+βKL(μi(s)πθ(s)),\mathcal L(\theta\mid s) = J_{\pi}(\theta\mid s) + \beta\,\mathrm{KL}(\mu_i^*(s)\|\pi_\theta(s)), where JπJ_\pi is a standard RL policy gradient or discrete Soft Actor-Critic (SAC) loss, with β>0\beta>0 controlling the trade-off between RL-driven exploitation and oracle-guided imitation (Lu et al., 21 Nov 2025).

This approach exploits the property that each νi\nu_i^* is provably worst-case optimal on GiG_i, so optimizing πθ\pi_\theta against these across many graphs induces an agent that captures structure-invariant principles of robust pursuit. The intended outcome is zero-shot generalization: πθ\pi_\theta achieves reliable performance even on graphs and initializations not seen during training.

2. Graph Neural Network Policy Architecture

EPG employs a homogeneous, decomposable policy for mm pursuers, represented sequentially as: π(a1,,ams)==1mπ(as,a1,,a1)\pi(a_1,\dots,a_m\mid s) = \prod_{\ell=1}^m \pi(a_\ell\mid s, a_1,\dots, a_{\ell-1}) At each step, the policy receives as input:

  • [shortest-path distances of all nodes to each pursuer]Rn[\,\text{shortest-path distances of all nodes to each pursuer}\,] \in\mathbb{R}^n
  • Pos\text{Pos}: the set of possible evader positions
  • belief\text{belief}: a belief distribution over the evader's location

A shared Graph Neural Network (GNN) backbone with masked self-attention layers processes these features into node embeddings. The encoder consists of 6 masked self-attention layers, each leveraging the graph adjacency mask MM: uij=qikjd,wij=exp(uij)jexp(uij),hi=j=1n(wijMij)vju_{ij} = \frac{q_i^\top k_j}{\sqrt{d}},\,\, w_{ij} = \frac{\exp(u_{ij})}{\sum_j \exp(u_{ij})},\,\, h'_i = \sum_{j=1}^n \bigl(w_{ij} \cdot M_{ij}\bigr) v_j A query is formed using the pursuer's current focus node, followed by an unmasked attention-based pointer network over neighbor node embeddings to produce π(a)\pi(a_\ell|\cdot) (Lu et al., 21 Nov 2025).

This parameter sharing makes the policy agnostic to graph size, topology, and degree, supporting zero-shot transfer across arbitrary graph families.

3. RL Algorithm, Training Losses, and Bellman Updates

The EPG learning process uses discrete Soft Actor-Critic as the backbone. Denoting Qϕ(s,a)Q_\phi(s,a) as the critic and Vψ(s)V_\psi(s) as the value network, the main update steps are:

  • Soft Q-update:

JQ(ϕ)=E[12(Qϕ(s,a)(r+γVψ(s)))2]J_Q(\phi) = \mathbb{E}\left[\frac{1}{2} \big(Q_\phi(s,a) - (r + \gamma V_\psi(s'))\big)^2\right]

  • Value update:

Vψ(s)Eaπθ(s)[Qϕ(s,a)αlogπθ(s,a)]V_\psi(s) \leftarrow \mathbb{E}_{a\sim\pi_\theta(s)}\big[Q_\phi(s,a) - \alpha\log\pi_\theta(s,a)\big]

  • Policy actor update (w/o guidance):

Jπ(θ)=Es,a[αlogπθ(s,a)Qϕ(s,a)]J_\pi(\theta) = \mathbb{E}_{s,a}\left[\alpha \log\pi_\theta(s,a) - Q_\phi(s,a)\right]

  • Policy actor update (EPG guidance):

L(θs)=Jπ(θs)+βKL(μ(s)πθ(s))=Jπ(θs)βlogπθ(s,a)\mathcal L(\theta|s) = J_\pi(\theta|s) + \beta\,\mathrm{KL}(\mu^*(s)\|\pi_\theta(s)) = J_\pi(\theta|s) - \beta \log\pi_\theta(s,a^*)

with a=μ(s)a^* = \mu^*(s) the DP reference action.

  • Entropy coefficient update:

J(α)=Es,a[α(logπθ(s,a)+Hˉ)]J(\alpha) = \mathbb{E}_{s,a}\big[-\alpha(\log\pi_\theta(s,a) + \bar H)\big]

Training batches are generated by uniform sampling over graphs GiG_i, initializations, and transitions, with adversarial evader moves sampled from νi\nu_i^* (Lu et al., 21 Nov 2025).

4. Handling Partial Observability via Belief Preservation

To extend EPG into partially observable settings (where pursuers do not always know the evader’s location), EPG is integrated with a belief preservation module embodying the following constructs:

  • PostV_t \subseteq V: Set of possible evader positions at time tt. Initialized to the true start, updated deterministically by neighborhood propagation and elimination of observed nodes.
  • belieft:V[0,1]_t : V \to [0,1]: Distribution over possible evader locations at time tt, updated recursively as: belieft+1(se)=vNeighbor(se)ν(v,se)belieft(v)\text{belief}_{t+1}(s_e) = \sum_{v \in \text{Neighbor}(s_e)}\nu(v, s_e)\,\text{belief}_t(v) where ν\nu defaults to uniform unless further information is available.

Two pursuit policies are then constructed:

  • Position-worst-case:

$\mu(s_p, \mathrm{Pos}) = \arg\min_{n_p \in \Neigh(s_p)} \max_{n_e \in \Neigh(\mathrm{Pos})} D(n_p, n_e)$

  • Belief-averaged:

$\mu(s_p, \mathrm{belief}) = \arg\min_{n_p} \frac{ \sum_{s_e} \mathrm{belief}(s_e)\, \max_{n_e \in \Neigh(s_e)} D(n_p, n_e)}{ \sum_{s_e} \mathrm{belief}(s_e) }$

When the possible-position set is a singleton, both policies coincide with the perfect-information DP policy (Lu et al., 21 Nov 2025).

5. Zero-Shot Experimental Protocols and Robustness Metrics

EPG’s effectiveness is assessed under an experimental protocol focusing on generalization and worst-case robustness:

  • Training environments: 150 synthetic random graphs (grids, dungeons) and 150 urban subgraphs from Google Maps, all with V500|V| \leq 500.
  • Testing (zero-shot): 10 previously unseen graphs (e.g., 10×1010\times10 grids, Scotland-Yard, Google-Maps Downtown, notable landmarks).
  • Opponents: Four evader behaviors—static, DP synchronous, DP asynchronous, best-responder asynchronous (the latter trained against the learned πθ\pi_\theta).
  • Evaluation metric: Success rate, defined as capture within 128 steps averaged over 500 random initializations.
  • Baselines: PSRO (Policy Space Response Oracles) trained directly on the test graphs, and extended DP pursuer baselines leveraging Pos and belief modules.

Empirical results demonstrate that EPG-trained GNN policies attain $50$–100%100\% capture success under all four evader types, consistently outperforming policies directly trained on test graphs with PSRO. Ablations show that belief averaging outperforms simple position set tracking, larger observation radii RR enhance performance monotonically, and more accurate belief propagation further improves robustness. Scalability is substantiated: inference on graphs with V2000|V|\approx 2000 nodes completes in under $0.01$ seconds, while naive DP recomputation exceeds $60$ seconds (Lu et al., 21 Nov 2025).

6. Significance, Limitations, and Extensions

EPG, when combined with graph neural policies and belief preservation, constitutes the first framework for computing worst-case robust, real-time pursuer policies for pursuit-evasion games that generalize zero-shot across new graphs and operate under both partial and asynchronous observability constraints.

A plausible implication is that the EPG paradigm, instantiated with efficient RL and GNN architectures, resolves key scalability bottlenecks inherent in classical DP approaches, while simultaneously preserving robustness properties previously unattainable by neural or meta-RL methods. However, the practical deployment of EPG relies on the availability of equilibrium oracles (μ,ν)(\mu^*, \nu^*) for the full training suite, and its performance bounds are inherited from the underlying DP and belief update fidelity. Extensions of the framework to alternative multi-agent competitive domains, or to more complex observation models, represent avenues for further investigation (Lu et al., 21 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Equilibrium Policy Generalization (EPG).