Papers
Topics
Authors
Recent
Search
2000 character limit reached

Actor–Critic Algorithms for Stackelberg MFGs

Updated 23 June 2026
  • The paper introduces actor–critic algorithms that solve hierarchical Stackelberg mean field games by concurrently updating leader and follower policies alongside a mean field.
  • It details both single-loop tabular methods and deep BSDE-based architectures that achieve robust, efficient convergence in complex bi-level optimization settings.
  • The study provides non-asymptotic convergence guarantees, discusses gradient alignment, and compares empirical benchmarks across discrete and continuous domains.

Actor–critic algorithms for Stackelberg mean field games (SMFGs) constitute a class of learning methods that address structured bi-level optimization problems where a leader (principal) and a population of followers (agents) interact strategically in a mean field environment. These frameworks generalize classical Stackelberg games to infinite-population settings, introducing intricate dependencies between the leader’s strategy, the induced mean field, and the equilibrium responses of the followers. Two representative methodologies are the AC-SMFG single-loop algorithm for discrete-time, tabular settings (Zeng et al., 18 Sep 2025), and alternating deep actor–critic architectures for principal–agent mean field games with continuous state–action spaces (Campbell et al., 2021).

1. Problem Formulation in Stackelberg Mean Field Games

SMFGs are hierarchical games with a single leader and a continuum of homogeneous followers. The system is described by:

  • State and Action Spaces: Let slSls_l \in \mathcal{S}_l (leader state), bAlb \in \mathcal{A}_l (leader action); sfSfs_f \in \mathcal{S}_f (follower state), aAfa \in \mathcal{A}_f (follower action). The joint state is s=(sl,sf)s = (s_l, s_f).
  • Mean-Field Term: The mean-field distribution μΔSf\mu \in \Delta_{\mathcal{S}_f} encodes the population law over follower states.
  • Transition Dynamics: Leader and follower transitions are respectively slPlμ(slsl,b)s_l' \sim P_l^\mu(s_l'|s_l, b) and sfPfμ(sfsf,a,b)s_f' \sim P_f^\mu(s_f'|s_f, a, b). The full joint chain is Pπ,ϕ,μ(ss)P^{\pi, \phi, \mu}(s'|s), induced by follower policy π\pi, leader policy bAlb \in \mathcal{A}_l0, and mean-field bAlb \in \mathcal{A}_l1.

The bi-level structure is captured as follows:

  • Follower (representative) reward: Regularized by entropy,

bAlb \in \mathcal{A}_l2

defining the best-response policy bAlb \in \mathcal{A}_l3 and the mean-field fixed point bAlb \in \mathcal{A}_l4.

  • Leader’s objective: bAlb \in \mathcal{A}_l5, maximized by anticipating equilibrium responses.

In the continuous-space principal–agent formulation (Campbell et al., 2021), the principal sets a terminal payment bAlb \in \mathcal{A}_l6 and agents, indexed by type bAlb \in \mathcal{A}_l7, control their dynamics in a mean-field Nash equilibrium. The equilibrium is characterized by a system of McKean–Vlasov forward–backward stochastic differential equations (MV-FBSDE).

2. Actor–Critic Algorithmic Architectures

Actor–critic algorithms for SMFGs implement coupled updates for the leader and follower policies and critics using shared trajectories and environment samples. Two paradigms are prominent:

(a) AC-SMFG (Tabular Softmax, Discrete Time)

  • Actors: Tabular softmax parameterizations for bAlb \in \mathcal{A}_l8 (leader) and bAlb \in \mathcal{A}_l9 (follower).
  • Critics: State-value estimators sfSfs_f \in \mathcal{S}_f0 (leader) and sfSfs_f \in \mathcal{S}_f1 (follower), trained with TD updates.

(b) Principal–Agent Deep Actor–Critic (Continuous Spaces)

  • Inner Actor (BSDE Solver): Deep neural networks parameterize solutions to the MV-FBSDE that define the Nash equilibrium for the given principal’s sfSfs_f \in \mathcal{S}_f2. These produce sample agent trajectories and controls.
  • Outer Critic: A feed-forward network sfSfs_f \in \mathcal{S}_f3 fits sampled values of the principal’s loss induced by sfSfs_f \in \mathcal{S}_f4.

Key characteristic: The algorithms alternate between policy (actor) updates driven by policy gradients and critic/value updates aligning policy evaluation with expected returns, configured to accommodate the bi-level, equilibrium nature of SMFGs.

3. Single-Loop and Nested Algorithmic Schemes

Traditional approaches use nested loops: for each leader (principal) policy iteration, a full equilibrium computation for the followers (agents) is performed. This results in high sample complexity and inefficient coupling between levels.

The AC-SMFG algorithm (Zeng et al., 18 Sep 2025) implements a single-loop stochastic approximation where all components (leader actor, follower actor, mean field, critics) are updated concurrently with distinct step sizes sfSfs_f \in \mathcal{S}_f5:

  • Sample two Markovian trajectories at each iteration: one for standard actor–critic TD learning, one for empirically updating the mean field.
  • Leader actor update: semi-gradient TD step using

sfSfs_f \in \mathcal{S}_f6

In the deep-learning principal–agent framework (Campbell et al., 2021), a nested actor–critic process utilizes:

  • Inner actor: neural solution to the MV-FBSDE (agents’ Nash equilibrium for fixed sfSfs_f \in \mathcal{S}_f7).
  • Sample-based evaluation of the principal’s objective for multiple sfSfs_f \in \mathcal{S}_f8, storing in a replay buffer.
  • Critic net fits loss surface in sfSfs_f \in \mathcal{S}_f9-space.
  • Policy improvement via gradient descent on aAfa \in \mathcal{A}_f0 approximated by the critic.

4. Convergence Theory and Gradient Alignment

A central challenge for SMFG algorithms is establishing convergence guarantees in the presence of bi-level, fixed-point dependencies and population-coupled feedback. The AC-SMFG algorithm’s complexity analysis (Zeng et al., 18 Sep 2025) makes use of:

  • Gradient-alignment condition: There exist aAfa \in \mathcal{A}_f1 such that (cf. Assumption 5)

aAfa \in \mathcal{A}_f2

and similarly for aAfa \in \mathcal{A}_f3. This allows partial gradients in the leader’s update to approximate the full Stackelberg gradient, relaxing former strict independence assumptions between leader and followers.

Under standard Lipschitz, smoothness, and ergodicity conditions, with step sizes aAfa \in \mathcal{A}_f4 and a multi-timescale structure, it is shown that

aAfa \in \mathcal{A}_f5

implying sample complexity aAfa \in \mathcal{A}_f6 to reach aAfa \in \mathcal{A}_f7. This is the first non-asymptotic guarantee for Stackelberg MFG learning identified in the literature (Zeng et al., 18 Sep 2025).

The proof relies on potential-Lyapunov arguments tracking the descent of residuals for leader optimality, mean field, follower optimality, and critic errors on appropriate timescales, employing the PL condition for the follower regularized MFG and Lipschitz continuity of best-responses.

5. Empirical Benchmarks and Applications

Empirical evaluation demonstrates the computational efficiency, robustness, and convergence properties of AC-SMFG and related actor–critic methods.

Benchmarked on three canonical mean field economy models from MFGLib:

  • Market entrance (binary): Leader skews the market; followers coordinate aggregate choices under crowd effects.
  • Beach-bar positioning: Leader’s placement influences follower location choices with a crowd-avoidance penalty.
  • Equilibrium pricing: Leader sets inventory cost, followers adjust production; endogenous pricing emerges.

Outcomes:

  • AC-SMFG converges 3–5× faster in leader reward than classical nested (OneByOne, weighted alt-best-response) and PPO-based (ADAGE) baselines, attaining comparable or higher final rewards for both leader and followers.
  • In continuous domains (equilibrium pricing), AC-SMFG with Gaussian mean-field approximation matches PPO-based multi-agent approaches in performance.
  • Learning curves are smoother and exhibit lower variance than baselines, illustrating sample efficiency inherent to the single-loop structure.

Applied to a Renewable-Energy-Certificate (REC) market:

  • Each agent manages inventory and capacity, optimizes expansion/generation/trading in a mean-field market with price-clearing constraint.
  • The principal’s penalty function aAfa \in \mathcal{A}_f8 is optimized by a critic-guided policy descent in aAfa \in \mathcal{A}_f9-space.

Numerical findings:

  • Optimal penalty weights are concentrated on key nodes; principal loss improves significantly as the function class for s=(sl,sf)s = (s_l, s_f)0 is enriched.
  • Agent trajectories display Stackelberg-consistent behavior—lower-initialized agents undertake more aggressive expansion to avoid penalties—while equilibrium prices are stable.
  • The alternating actor–critic architecture is crucial for efficient, robust solution of the bi-level MV-FBSDE optimization.

6. Comparative Analysis and Methodological Considerations

Table: Summary of Key Algorithmic Features

Method Policy Parametrization Update Scheme Convergence Guarantee
AC-SMFG (Zeng et al., 18 Sep 2025) Tabular Softmax (discrete) Single-loop Non-asymptotic, s=(sl,sf)s = (s_l, s_f)1
Principal–Agent (Campbell et al., 2021) Deep BSDE networks (continuous) Nested actor–critic Empirical (no finite-sample theorem)

Both methods avoid the inefficiency of strict nested-loop fictitious play baselines by architecting coupled actor–critic updates. AC-SMFG is notable for its simplicity (discrete TD updates) and provable convergence rate, contingent on gradient-alignment; the deep learning approach enables treatment of high-dimensional, nonlinear, constrained mean-field environments but currently lacks analogous sample-complexity bounds.

A plausible implication is that incorporating gradient alignment or multi-timescale analysis in deep architectures could lead to stronger theoretical guarantees for neural Stackelberg MFG solvers. Conversely, the scalability and expressivity of deep BSDE-based approaches may enable application to more complex multi-population or market-clearing environments that challenge tabular or simpler methods.

7. Extensions and Open Directions

Both frameworks naturally generalize:

  • AC-SMFG can extend to continuous controls via softmax/soft-actor parametrizations; function approximators (e.g., neural critics) replace tabular value functions.
  • Deep actor–critic methods based on MV-FBSDE are suited for high-dimensional, constrained, nonlinear Stackelberg MFGs, including principal–multi-agent market design, risk-sensitive controls, and contract design.

Current open questions include:

  • Establishing finite-time, non-asymptotic convergence guarantees for deep SMFG solvers.
  • Relaxing or verifying gradient-alignment in high-dimensional neural policy spaces.
  • Scalable mean-field learning under partial observability or non-Markovian population states.

These developments position actor–critic methods as foundational algorithmic tools for hierarchical control, mechanism design, and market regulation in large-scale decentralized systems governed by Stackelberg mean field game structures (Zeng et al., 18 Sep 2025, Campbell et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Actor–Critic Algorithms for Stackelberg Mean Field Games.