Graph-Based Agent Planning (GAP)

Updated 1 November 2025

Graph-Based Agent Planning (GAP) is a framework for multi-agent decision-making that models interactions via explicit graph structures to address non-local, high-order cost functions.
The approach reframes planning as a variational inference problem, using perturbative expansions to decompose global objectives into scalable, agent-specific ODEs.
Integrating coupled ODEs with an EM algorithm, GAP demonstrates near-optimal performance in tasks like epidemic control and forest management under weak coupling conditions.

Graph-Based Agent Planning (GAP) is an advanced framework for multi-agent decision-making and control in domains where agent interactions are structured by an explicit graph. This paradigm replaces traditional global coordination with a formalism where agents are interlinked via a static or dynamic graph, and system dynamics, rewards, and constraints are mediated through the corresponding graph topology. In practical terms, GAP provides a principled relaxation of the multi-agent Markov decision process (MDP), admitting scalable approximate planning even under non-local, high-order cost functions. The approach is motivated by applications where non-local objectives predominate—such as epidemic control, forest management, networked opinion dynamics, and synchrony in distributed systems.

1. Planning as Variational Inference on Graphs

A key conceptual innovation of modern GAP is the reduction of the planning problem to a variational inference task. Rather than maximize expected cumulative reward directly, GAP frames planning as maximizing a variational lower bound on the marginal log-likelihood of achieving a global success indicator variable, following the “planning-as-inference” principle: $V^\pi_p(s_0) = \mathsf{E}_p\left[\int_0^\infty dt \, \gamma^t R(S(t),A(t)) \mid S(0)=s_0, \pi\right]$ The optimal policy is given by $\pi^* = \arg\max_\pi V^\pi_p(s_0)$ , but this is reframed by introducing an auxiliary trajectory-based variable $Z(t)$ so that

$\ln p(Z_{[0,T]} = 1 | \pi, s_0) \geq \mathcal{F}[q, \pi] + V^\pi_q(s_0)$

where

$\mathcal{F}[q, \pi] = - D_{\text{KL}}[q(\text{trajectories}) \;\|\; p(\text{trajectories}| \pi, s_0)]$

Here, $q$ is a variational distribution over trajectories. The planning objective becomes maximizing this lower bound, recasting planning as an inference task over a trajectory space defined by the agent interaction graph.

2. Variational Perturbative Approach and Series Expansion

The primary technical advance in graph-based planning for GMDPs is the application of variational perturbation theory (VPT), extending statistical physics techniques to structured agent planning. The KL divergence in the lower bound is systematically expanded as a power series in a small interaction parameter $\varepsilon$ reflecting the strength of coupling across the agent graph: $\mathcal{F}[q, \pi] = \mathcal{F}^{(0)}[q, \pi] + \varepsilon \mathcal{F}^{(1)}[q, \pi] + \ldots$ The base term $\mathcal{F}^{(0)}$ corresponds to a mean-field-like decoupling (agents act independently), while higher order terms incorporate the structured dependencies introduced by the graph. The transition kernel itself is written as

$p_h(y_n \mid x_n, u_n, a_n) = p_h(y_n \mid x_n, a_n) + \varepsilon \tilde{g}(y_n, x_n, u_n, a_n)$

where $u_n$ encodes the subset of parent node configurations. This permits agentwise decomposition of objectives for weakly coupled GMDPs: $f^h_t[q, \pi] = \sum_{n=1}^N f^h_{t,n}[q, \pi_n] + o(\varepsilon)$ Critically, this allows for scalable, controlled approximation: the expansion can be systematically extended to arbitrary order for increased fidelity.

3. Scalable Optimization via Coupled ODEs and EM Algorithm

Exploiting the agentwise expansion, the variational lower bound leads to a set of coupled ordinary differential equations (ODEs) for each agent’s marginal probabilities and corresponding Lagrange multipliers: $\dot{q}_n(t) = q_n(t) \Omega_n(t)$

$\dot{\rho}_n(t) = [\Omega_n(t) + \Theta_n(t) + \Psi_n(t)] \rho_n(t)$

where $\Omega, \Theta, \Psi$ aggregate the relevant stochastic transition rates, local rewards, and discounting effects. Owing to the weak coupling assumption, the number of ODEs grows linearly with the number of agents, making the framework applicable to substantially larger graphs than exact MDP solutions.

The overall planning algorithm proceeds as an expectation-maximization (EM) loop:

E-step: Solve the ODEs for current marginal distributions $q_n(x; t)$ given fixed policies.
M-step: Update each agent’s policy locally using the derived marginals.

This approach is optimal within the variational class and provides rates of convergence determined by the small parameter $\varepsilon$ .

Summary Table: Per-Agent Dynamics

Component	Equation	Scaling
Marginals ODE	$\dot{q}_n(t) = q_n(t) \Omega_n(t)$	Linear in N
Dual Variables	$\dot{\rho}_n(t) = [\Omega_n(t) + \Theta_n(t) + \Psi_n(t)] \rho_n(t)$	Linear in N

4. Empirical Performance: Non-Local Rewards and Benchmark Results

Quantitative evaluation demonstrates the critical strengths of variational perturbative GAP specifically in environments with non-local reward structures:

Benchmarks: Disease control, forest management with non-local couplings, voter/Ising opinion dynamics, and large-scale synchronization tasks.
Baselines: Approximate Policy Iteration (mean-field API) and Approximate Linear Programming (ALP).

Key findings:

In cases with non-local cost functions, the proposed method achieves 0–2% deviation from the optimal policy, whereas API and ALP produce deviations up to 38–58%, often performing comparably to random guessing.
In large synchronization grids and voter models, the method realizes optimal synchronization (minimum order parameter), while mean-field and LP-based methods stagnate.
Scalability: Due to the linear agent scaling of the ODEs, the approach is feasible for larger graphs than previous variational and LP approaches.

Empirical Deviations Table (from paper)

Task	VPT-GAP Deviation (%)	API (%)	ALP (%)	Random (%)
Forest management	0–1	1–38	1–58	11–52
Opinion/voter	0–2	up to 10+	up to 10+	—

5. Applicability and Domain Impact

The variational perturbative formalism for GAP is broadly applicable to multi-agent systems structured by graphs with (weakly) non-local interactions. Notable application domains:

Opinion and consensus dynamics on social or physical networks (e.g., Ising/voter models).
Network control: epidemic or disease spread mitigation, resource allocation.
Distributed synchronization (sensor, robotic, or communication networks). The systematic expansion accommodates both purely local and weakly non-local graph structures, enabling accurate, scalable planning where previous mean-field or LP-based approaches are inadequate.

A plausible implication is that this generalizes to any sufficiently weakly coupled GMDP, with the limiting step being the expansion's accuracy as the coupling parameter increases.

6. Limitations and Future Research Directions

Principal limitations of current GAP methodology include:

Weak Coupling Restriction: The variational expansion is accurate only when the coupling parameter $\varepsilon$ is small. As coupling strength increases, higher-order corrections become necessary and the method’s performance may degrade.
Truncation of Series: Only leading- and low-order terms are tractable in practice. Strongly interacting or fully connected graphs are not well approximated.
Full Planning Task: The method assumes full knowledge of system dynamics and is not yet extended to reinforcement learning settings where the transition model is unknown or only accessible via sampling. Extension to model-free or model-based RL is an open research frontier.

7. Mathematical Summary and Theoretical Guarantees

The central mathematical expressions governing GAP by VPT are:

Planning as lower bound maximization:

$\ln p(Z_{[0,T]} = 1 \mid \pi, s_0) \geq \mathcal{F}[q, \pi] + V^\pi_q(s_0)$

with

$\mathcal{F}[q, \pi]\equiv -D_{\text{KL}}[q(\text{trajectory})\|p(\text{trajectory}| \pi, s_0)]$

Variational perturbative expansion:

$\mathcal{F}[q, \pi] = \mathcal{F}^{(0)}[q, \pi] + \varepsilon \mathcal{F}^{(1)}[q, \pi] + \ldots$

Nodewise agent decomposition and ODEs as stationary points of a constrained Lagrangian.

These express controlled, theoretically sound approximations with formal bounds determined by the truncation order and coupling strength.

In summary, Graph-Based Agent Planning via variational perturbative inference provides a powerful, systematically improvable toolkit for planning in large-scale, weakly coupled multi-agent systems structured by a graph. The approach demonstrates state-of-the-art empirical performance on non-local planning tasks, linear scalability with system size, and rigorous mathematical underpinnings. Its principal limitation is its dependence on the weak coupling regime, with extensions to strong interaction and model-uncertain (RL) settings being active directions for future research (Linzner et al., 2019).

Markdown Upgrade to Chat

References (1)

A Variational Perturbative Approach to Planning in Graph-based Markov Decision Processes (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Graph-Based Agent Planning (GAP).