Actor–Critic Algorithms for Stackelberg MFGs

Updated 23 June 2026

The paper introduces actor–critic algorithms that solve hierarchical Stackelberg mean field games by concurrently updating leader and follower policies alongside a mean field.
It details both single-loop tabular methods and deep BSDE-based architectures that achieve robust, efficient convergence in complex bi-level optimization settings.
The study provides non-asymptotic convergence guarantees, discusses gradient alignment, and compares empirical benchmarks across discrete and continuous domains.

Actor–critic algorithms for Stackelberg mean field games (SMFGs) constitute a class of learning methods that address structured bi-level optimization problems where a leader (principal) and a population of followers (agents) interact strategically in a mean field environment. These frameworks generalize classical Stackelberg games to infinite-population settings, introducing intricate dependencies between the leader’s strategy, the induced mean field, and the equilibrium responses of the followers. Two representative methodologies are the AC-SMFG single-loop algorithm for discrete-time, tabular settings (Zeng et al., 18 Sep 2025), and alternating deep actor–critic architectures for principal–agent mean field games with continuous state–action spaces (Campbell et al., 2021).

1. Problem Formulation in Stackelberg Mean Field Games

SMFGs are hierarchical games with a single leader and a continuum of homogeneous followers. The system is described by:

State and Action Spaces: Let $s_l \in \mathcal{S}_l$ (leader state), $b \in \mathcal{A}_l$ (leader action); $s_f \in \mathcal{S}_f$ (follower state), $a \in \mathcal{A}_f$ (follower action). The joint state is $s = (s_l, s_f)$ .
Mean-Field Term: The mean-field distribution $\mu \in \Delta_{\mathcal{S}_f}$ encodes the population law over follower states.
Transition Dynamics: Leader and follower transitions are respectively $s_l' \sim P_l^\mu(s_l'|s_l, b)$ and $s_f' \sim P_f^\mu(s_f'|s_f, a, b)$ . The full joint chain is $P^{\pi, \phi, \mu}(s'|s)$ , induced by follower policy $\pi$ , leader policy $b \in \mathcal{A}_l$ 0, and mean-field $b \in \mathcal{A}_l$ 1.

The bi-level structure is captured as follows:

Follower (representative) reward: Regularized by entropy,

$b \in \mathcal{A}_l$ 2

defining the best-response policy $b \in \mathcal{A}_l$ 3 and the mean-field fixed point $b \in \mathcal{A}_l$ 4.

Leader’s objective: $b \in \mathcal{A}_l$ 5, maximized by anticipating equilibrium responses.

In the continuous-space principal–agent formulation (Campbell et al., 2021), the principal sets a terminal payment $b \in \mathcal{A}_l$ 6 and agents, indexed by type $b \in \mathcal{A}_l$ 7, control their dynamics in a mean-field Nash equilibrium. The equilibrium is characterized by a system of McKean–Vlasov forward–backward stochastic differential equations (MV-FBSDE).

2. Actor–Critic Algorithmic Architectures

Actor–critic algorithms for SMFGs implement coupled updates for the leader and follower policies and critics using shared trajectories and environment samples. Two paradigms are prominent:

(a) AC-SMFG (Tabular Softmax, Discrete Time)

Actors: Tabular softmax parameterizations for $b \in \mathcal{A}_l$ 8 (leader) and $b \in \mathcal{A}_l$ 9 (follower).
Critics: State-value estimators $s_f \in \mathcal{S}_f$ 0 (leader) and $s_f \in \mathcal{S}_f$ 1 (follower), trained with TD updates.

(b) Principal–Agent Deep Actor–Critic (Continuous Spaces)

Inner Actor (BSDE Solver): Deep neural networks parameterize solutions to the MV-FBSDE that define the Nash equilibrium for the given principal’s $s_f \in \mathcal{S}_f$ 2. These produce sample agent trajectories and controls.
Outer Critic: A feed-forward network $s_f \in \mathcal{S}_f$ 3 fits sampled values of the principal’s loss induced by $s_f \in \mathcal{S}_f$ 4.

Key characteristic: The algorithms alternate between policy (actor) updates driven by policy gradients and critic/value updates aligning policy evaluation with expected returns, configured to accommodate the bi-level, equilibrium nature of SMFGs.

3. Single-Loop and Nested Algorithmic Schemes

Traditional approaches use nested loops: for each leader (principal) policy iteration, a full equilibrium computation for the followers (agents) is performed. This results in high sample complexity and inefficient coupling between levels.

The AC-SMFG algorithm (Zeng et al., 18 Sep 2025) implements a single-loop stochastic approximation where all components (leader actor, follower actor, mean field, critics) are updated concurrently with distinct step sizes $s_f \in \mathcal{S}_f$ 5:

Sample two Markovian trajectories at each iteration: one for standard actor–critic TD learning, one for empirically updating the mean field.
Leader actor update: semi-gradient TD step using

$s_f \in \mathcal{S}_f$ 6

Follower actor update: policy gradient TD step with entropy regularization.
Critic updates: TD(0) learning on both leader and follower critics.
Mean-field update: projection onto the simplex after empirical distribution shift.

In the deep-learning principal–agent framework (Campbell et al., 2021), a nested actor–critic process utilizes:

Inner actor: neural solution to the MV-FBSDE (agents’ Nash equilibrium for fixed $s_f \in \mathcal{S}_f$ 7).
Sample-based evaluation of the principal’s objective for multiple $s_f \in \mathcal{S}_f$ 8, storing in a replay buffer.
Critic net fits loss surface in $s_f \in \mathcal{S}_f$ 9-space.
Policy improvement via gradient descent on $a \in \mathcal{A}_f$ 0 approximated by the critic.

4. Convergence Theory and Gradient Alignment

A central challenge for SMFG algorithms is establishing convergence guarantees in the presence of bi-level, fixed-point dependencies and population-coupled feedback. The AC-SMFG algorithm’s complexity analysis (Zeng et al., 18 Sep 2025) makes use of:

Gradient-alignment condition: There exist $a \in \mathcal{A}_f$ 1 such that (cf. Assumption 5)

$a \in \mathcal{A}_f$ 2

and similarly for $a \in \mathcal{A}_f$ 3. This allows partial gradients in the leader’s update to approximate the full Stackelberg gradient, relaxing former strict independence assumptions between leader and followers.

Under standard Lipschitz, smoothness, and ergodicity conditions, with step sizes $a \in \mathcal{A}_f$ 4 and a multi-timescale structure, it is shown that

$a \in \mathcal{A}_f$ 5

implying sample complexity $a \in \mathcal{A}_f$ 6 to reach $a \in \mathcal{A}_f$ 7. This is the first non-asymptotic guarantee for Stackelberg MFG learning identified in the literature (Zeng et al., 18 Sep 2025).

The proof relies on potential-Lyapunov arguments tracking the descent of residuals for leader optimality, mean field, follower optimality, and critic errors on appropriate timescales, employing the PL condition for the follower regularized MFG and Lipschitz continuity of best-responses.

5. Empirical Benchmarks and Applications

Empirical evaluation demonstrates the computational efficiency, robustness, and convergence properties of AC-SMFG and related actor–critic methods.

Benchmarked on three canonical mean field economy models from MFGLib:

Market entrance (binary): Leader skews the market; followers coordinate aggregate choices under crowd effects.
Beach-bar positioning: Leader’s placement influences follower location choices with a crowd-avoidance penalty.
Equilibrium pricing: Leader sets inventory cost, followers adjust production; endogenous pricing emerges.

Outcomes:

AC-SMFG converges 3–5× faster in leader reward than classical nested (OneByOne, weighted alt-best-response) and PPO-based (ADAGE) baselines, attaining comparable or higher final rewards for both leader and followers.
In continuous domains (equilibrium pricing), AC-SMFG with Gaussian mean-field approximation matches PPO-based multi-agent approaches in performance.
Learning curves are smoother and exhibit lower variance than baselines, illustrating sample efficiency inherent to the single-loop structure.

Applied to a Renewable-Energy-Certificate (REC) market:

Each agent manages inventory and capacity, optimizes expansion/generation/trading in a mean-field market with price-clearing constraint.
The principal’s penalty function $a \in \mathcal{A}_f$ 8 is optimized by a critic-guided policy descent in $a \in \mathcal{A}_f$ 9-space.

Numerical findings:

Optimal penalty weights are concentrated on key nodes; principal loss improves significantly as the function class for $s = (s_l, s_f)$ 0 is enriched.
Agent trajectories display Stackelberg-consistent behavior—lower-initialized agents undertake more aggressive expansion to avoid penalties—while equilibrium prices are stable.
The alternating actor–critic architecture is crucial for efficient, robust solution of the bi-level MV-FBSDE optimization.

6. Comparative Analysis and Methodological Considerations

Table: Summary of Key Algorithmic Features

Method	Policy Parametrization	Update Scheme	Convergence Guarantee
AC-SMFG (Zeng et al., 18 Sep 2025)	Tabular Softmax (discrete)	Single-loop	Non-asymptotic, $s = (s_l, s_f)$ 1
Principal–Agent (Campbell et al., 2021)	Deep BSDE networks (continuous)	Nested actor–critic	Empirical (no finite-sample theorem)

Both methods avoid the inefficiency of strict nested-loop fictitious play baselines by architecting coupled actor–critic updates. AC-SMFG is notable for its simplicity (discrete TD updates) and provable convergence rate, contingent on gradient-alignment; the deep learning approach enables treatment of high-dimensional, nonlinear, constrained mean-field environments but currently lacks analogous sample-complexity bounds.

A plausible implication is that incorporating gradient alignment or multi-timescale analysis in deep architectures could lead to stronger theoretical guarantees for neural Stackelberg MFG solvers. Conversely, the scalability and expressivity of deep BSDE-based approaches may enable application to more complex multi-population or market-clearing environments that challenge tabular or simpler methods.

7. Extensions and Open Directions

Both frameworks naturally generalize:

AC-SMFG can extend to continuous controls via softmax/soft-actor parametrizations; function approximators (e.g., neural critics) replace tabular value functions.
Deep actor–critic methods based on MV-FBSDE are suited for high-dimensional, constrained, nonlinear Stackelberg MFGs, including principal–multi-agent market design, risk-sensitive controls, and contract design.

Current open questions include:

Establishing finite-time, non-asymptotic convergence guarantees for deep SMFG solvers.
Relaxing or verifying gradient-alignment in high-dimensional neural policy spaces.
Scalable mean-field learning under partial observability or non-Markovian population states.

These developments position actor–critic methods as foundational algorithmic tools for hierarchical control, mechanism design, and market regulation in large-scale decentralized systems governed by Stackelberg mean field game structures (Zeng et al., 18 Sep 2025, Campbell et al., 2021).

Markdown Report Issue Upgrade to Chat

References (2)

Learning in Stackelberg Mean Field Games: A Non-Asymptotic Analysis (2025)

Deep Learning for Principal-Agent Mean Field Games (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Actor–Critic Algorithms for Stackelberg Mean Field Games.

Actor–Critic Algorithms for Stackelberg MFGs

1. Problem Formulation in Stackelberg Mean Field Games

2. Actor–Critic Algorithmic Architectures

(a) AC-SMFG (Tabular Softmax, Discrete Time)

(b) Principal–Agent Deep Actor–Critic (Continuous Spaces)

3. Single-Loop and Nested Algorithmic Schemes

4. Convergence Theory and Gradient Alignment

5. Empirical Benchmarks and Applications

AC-SMFG (Zeng et al., 18 Sep 2025)

Principal–Agent Deep Actor–Critic (Campbell et al., 2021)

6. Comparative Analysis and Methodological Considerations

7. Extensions and Open Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Actor–Critic Algorithms for Stackelberg MFGs

1. Problem Formulation in Stackelberg Mean Field Games

2. Actor–Critic Algorithmic Architectures

(a) AC-SMFG (Tabular Softmax, Discrete Time)

(b) Principal–Agent Deep Actor–Critic (Continuous Spaces)

3. Single-Loop and Nested Algorithmic Schemes

4. Convergence Theory and Gradient Alignment

5. Empirical Benchmarks and Applications

AC-SMFG (Zeng et al., 18 Sep 2025)

Principal–Agent Deep Actor–Critic (Campbell et al., 2021)

6. Comparative Analysis and Methodological Considerations

7. Extensions and Open Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics