Papers
Topics
Authors
Recent
2000 character limit reached

Mean Actor-Critic (MAC) Methods

Updated 19 January 2026
  • Mean Actor-Critic (MAC) is a reinforcement learning approach that analytically averages over all possible actions to compute unbiased, lower-variance policy gradients.
  • MAC reduces variance by replacing stochastic action sampling with exact summation, leading to improved sample efficiency and more stable learning in both discrete-action and mean-field settings.
  • In mean-field control applications, MAC employs moment neural networks and analytic expectations to approximate value functions, achieving provable convergence to Nash equilibria.

Mean Actor-Critic (MAC) is a class of reinforcement learning (RL) algorithms that modify the classic actor-critic framework by analytically averaging policy gradients over the entire action set, rather than estimating gradients using only the actions actually executed. This approach yields lower-variance policy gradient estimates, improved sample efficiency, and, in the context of mean-field control, extends to learning over distributions of states and actions. The term “Mean Actor-Critic” has been used in both standard RL for discrete-action spaces (Allen et al., 2017) and in mean-field control/game theory (continuous/distributional MAC) (Pham et al., 2023, Frikha et al., 2023, Fu et al., 2019).

1. The Mean Actor-Critic Principle in Reinforcement Learning

Classical on-policy policy-gradient algorithms aim to maximize the expected return: J(θ)=Esdπ,aπθ[Qπ(s,a)]J(\theta) = \mathbb{E}_{s\sim d^\pi,\,a\sim\pi_\theta}[Q^\pi(s,a)] The policy-gradient theorem yields: θJ(θ)=Esdπ,aπθ[θlogπθ(as)Qπ(s,a)]\nabla_\theta J(\theta) = \mathbb{E}_{s\sim d^\pi,\,a\sim\pi_\theta}[\nabla_\theta\log\pi_\theta(a|s) Q^\pi(s,a)] In standard actor-critic (AC) frameworks, this expectation is estimated by sampling both states ss and executed actions aa. The MAC algorithm explores the equivalence: Eaπθ[θlogπθ(as)Qπ(s,a)]=aAπθ(as)θlogπθ(as)Qπ(s,a)\mathbb{E}_{a\sim\pi_\theta}\big[\nabla_\theta\log\pi_\theta(a|s) Q^\pi(s,a)\big] = \sum_{a\in\mathcal{A}} \pi_\theta(a|s) \nabla_\theta\log\pi_\theta(a|s) Q^\pi(s,a) By pushing the sum inside the expectation over states, MAC computes the policy gradient for each sampled state by analytically averaging over all actions—using the explicit form of πθ(as)\pi_\theta(a|s) and Qπ(s,a)Q^\pi(s,a)—rather than relying on the particular action sampled. This results in a policy gradient estimator of the form: θJMAC(θ)=Esdπ[aAπθ(as)θlogπθ(as)Qπ(s,a)]\nabla_\theta J_{\rm MAC}(\theta) = \mathbb{E}_{s\sim d^\pi}\left[\sum_{a\in\mathcal{A}} \pi_\theta(a|s) \nabla_\theta\log\pi_\theta(a|s) Q^\pi(s,a)\right] This estimator applies to discrete A\mathcal{A} and can be implemented efficiently if the action set is modest in size (Allen et al., 2017).

2. Theoretical Properties: Variance Reduction

The MAC gradient estimator is unbiased: it corresponds exactly to the true policy gradient as per the policy-gradient theorem. However, by replacing stochastic sampling over actions with an exact sum (or expectation), MAC eliminates the variance associated with action sampling. Specifically, for parameter component ii, the MAC estimator

MACi=1Tt=1Taπθ(ast)θilogπθ(ast)Q^(st,a;ω)\mathrm{MAC}_i = \frac{1}{T} \sum_{t=1}^T \sum_{a} \pi_\theta(a|s_t) \nabla_{\theta_i}\log\pi_\theta(a|s_t) \widehat Q(s_t, a; \omega)

has strictly lower variance than its standard actor-critic counterpart, as long as πθ(s)\pi_\theta(\cdot|s) is not deterministic, assuming independence of Q^(s,a)\widehat Q(s,a) errors across (s,a)(s, a). This follows directly from Jensen's inequality applied to the variance across the discrete action space (Allen et al., 2017). No additional baseline function is required—subtracting an action-independent baseline cancels out exactly in the MAC formulation.

3. Implementation and Algorithmic Steps

A high-level algorithmic outline for the discrete-action MAC in RL is as follows (Allen et al., 2017):

  1. Initialize policy parameters θ\theta and critic parameters ω\omega.
  2. Rollout the current policy to collect a batch of TT state transitions {(st,at,rt,st+1)}\{(s_t, a_t, r_t, s_{t+1})\}.
  3. Critic Update: Fit Q^(s,a;ω)\widehat Q(s, a; \omega) (e.g., by TD(0), TD(λ\lambda), or Monte Carlo) to approximate Qπ(s,a)Q^\pi(s,a).
  4. Policy Update: For each sts_t in the batch, compute the policy gradient

gt=aAπθ(ast)θlogπθ(ast)Q^(st,a;ω)g_t = \sum_{a \in \mathcal{A}} \pi_\theta(a|s_t) \nabla_\theta\log\pi_\theta(a|s_t)\,\widehat Q(s_t,a;\omega)

Perform a gradient ascent step: θθ+α(1/T)t=1Tgt\theta \leftarrow \theta + \alpha \cdot (1/T)\sum_{t=1}^T g_t.

  1. Repeat until convergence, optionally decaying the step size.

The MAC framework naturally applies to cases where actions are discrete and A\mathcal{A} is not prohibitively large. Over-sampling critic updates per policy update can further enhance stability.

4. Applications in Mean-Field Control and Mean-Field Games

The MAC nomenclature has also been widely adopted in continuous-time reinforcement learning for mean-field control (MFC) and mean-field games (MFG). These settings generalize the single-agent RL context by introducing population-level effects (state distribution μt\mu_t evolving with the agents).

In continuous-time mean-field MAC algorithms (Pham et al., 2023, Frikha et al., 2023), the state process evolves as a McKean–Vlasov SDE: dXt=b(t,Xt,μt,αt)dt+σ(t,Xt,μt,αt)dWtdX_t = b(t, X_t, \mu_t, \alpha_t)\,dt + \sigma(t, X_t, \mu_t, \alpha_t)\,dW_t The learning task is to minimize E[0Tf(Xt,μt,αt)dt+g(XT,μT)]\mathbb{E}[\int_0^T f(X_t, \mu_t, \alpha_t)\,dt + g(X_T, \mu_T)] over randomized policies πθ\pi_\theta.

Key features distinctive to mean-field MAC approaches:

  • Parameterization: The policy (“actor”) is typically a Gaussian whose mean is represented by a moment neural network (moment-NN) depending on xx and low-order empirical moments of μt\mu_t. The critic is also a moment-NN approximating the value function over the Wasserstein space of distributions.
  • Policy gradient: Incorporates both standard policy-gradient terms and additional terms arising from the mean-field (distributional) dependence, involving functional derivatives and explicit mean-field operators.
  • Critic loss: Enforces a martingale (HJB) condition using temporal-difference losses adapted to Wasserstein space inputs.
  • Monte Carlo over distributions: Training samples entire trajectories of particle systems to approximate distributional expectations.

The essential computation remains: MAC marginalizes action randomness in the policy gradient, using either exact summation or analytic expectation, which is particularly tractable in the Gaussian parameterization context for continuous action spaces.

5. Empirical Findings and Benchmarks

Classic RL (Discrete-action, Non-Mean-Field):

  • MAC achieved faster learning than REINFORCE, Advantage REINFORCE, actor-critic, and advantage actor-critic baselines on CartPole; matched top baselines on LunarLander.
  • In six Atari 2600 games (Beamrider, Breakout, Pong, Q*bert, Seaquest, SpaceInvaders), MAC was competitive with state-of-the-art policy search methods (TRPO, Evolution Strategies, A2C), frequently outperforming or matching them after 50 million frames.
  • Lower empirical variance translated to smoother learning curves and higher sample efficiency, with particularly stable updates in high-variance games (Allen et al., 2017).

Mean-Field Control and Games:

  • Moment-NN MAC delivers sub-percent errors for linear-quadratic (LQ) models and robust tracking for nonlinear settings, even in two and three-dimensional MFC problems (Pham et al., 2023).
  • In large particle regimes (M104M \sim 10^4 or higher) and with moderate moment order (L=2 or 3), MAC’s learned policies and value functions closely match analytic solutions.
  • Actor-critic MAC in discrete-time linear-quadratic MFGs achieves provable global linear convergence to Nash equilibria, provided standard LQ-MFG structural assumptions are satisfied (Fu et al., 2019).

A summary table of select empirical domains is given below:

Domain MAC Variant Key Outcome
CartPole, LunarLander Discrete-action MAC Faster or equal learning; lower gradient variance
Atari 2600 (6 games) Discrete-action MAC Competitive/stable vs TRPO, ES, A2C (after 50M frames)
Systemic-risk LQ Moment-NN MAC Sub-percent value errors, matched control trajectories
1D/2D/3D LQ optimal trade Moment-NN MAC Errors 1–2%; robust for nonlinear, controlled-volatility
LQ mean-field games Linear MAC (MFG) Provable linear convergence to Nash equilibria

6. Connections to Broader RL and Mean-Field Literature

The MAC approach formalizes a long-standing method for reducing variance in policy gradient estimators by replacing stochastic action sampling with analytical averaging, made feasible by tractable action spaces or efficient parameterizations. In RL, MAC is especially effective for discrete, moderate-size action sets. In mean-field settings, the combination of MAC with moment-NNs enables function approximation on spaces of distributions, a central technical problem in mean-field control and games.

The MAC formulation is distinct from baseline subtraction, advantage estimation, or entropy regularization. It is orthogonal to improvements in critic learning, function approximation architecture, or policy regularization. In mean-field game theory, MAC supports model-free learning of Nash equilibria under linear-quadratic structure with global convergence guarantees and extends naturally to high-dimensional, nonlinear, and volatility-controlled settings.

7. Limitations and Computational Considerations

MAC’s principal advantage is variance reduction without extra bias. However, the inner expectation over actions or analytic computation of integrals/sums can be prohibitive for large discrete or high-dimensional continuous action spaces. In MFC/MFG, efficiency is preserved by parameterizing policies and critics with moment-NNs and fitting over low-order empirical moments, but the computational cost scales with the number of particles and the order of moments used. Training in high dimensions or with high-order moments may require considerable compute time (reported as 4×104\sim 4 \times 10^4 to 1.5×1051.5 \times 10^5 seconds on V100 GPUs for certain mean-field problems) (Pham et al., 2023). A plausible implication is that scalability to large-scale real-world MFC depends on further advances in neural architecture and sampling strategy.

References

  • (Allen et al., 2017) "Mean Actor Critic" — original discrete-action MAC in RL.
  • (Pham et al., 2023) "Actor critic learning algorithms for mean-field control with moment neural networks" — continuous-time mean-field MAC with moment-NN parameterization.
  • (Frikha et al., 2023) "Actor-Critic learning for mean-field control in continuous time" — continuous-time mean-field MAC with Wasserstein space parametrization.
  • (Fu et al., 2019) "Actor-Critic Provably Finds Nash Equilibria of Linear-Quadratic Mean-Field Games" — discrete-time MFG with provable convergence.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mean Actor-Critic (MAC).