Mean Actor-Critic (MAC) Methods

Updated 19 January 2026

Mean Actor-Critic (MAC) is a reinforcement learning approach that analytically averages over all possible actions to compute unbiased, lower-variance policy gradients.
MAC reduces variance by replacing stochastic action sampling with exact summation, leading to improved sample efficiency and more stable learning in both discrete-action and mean-field settings.
In mean-field control applications, MAC employs moment neural networks and analytic expectations to approximate value functions, achieving provable convergence to Nash equilibria.

Mean Actor-Critic (MAC) is a class of reinforcement learning (RL) algorithms that modify the classic actor-critic framework by analytically averaging policy gradients over the entire action set, rather than estimating gradients using only the actions actually executed. This approach yields lower-variance policy gradient estimates, improved sample efficiency, and, in the context of mean-field control, extends to learning over distributions of states and actions. The term “Mean Actor-Critic” has been used in both standard RL for discrete-action spaces (Allen et al., 2017) and in mean-field control/game theory (continuous/distributional MAC) (Pham et al., 2023, Frikha et al., 2023, Fu et al., 2019).

1. The Mean Actor-Critic Principle in Reinforcement Learning

Classical on-policy policy-gradient algorithms aim to maximize the expected return: $J(\theta) = \mathbb{E}_{s\sim d^\pi,\,a\sim\pi_\theta}[Q^\pi(s,a)]$ The policy-gradient theorem yields: $\nabla_\theta J(\theta) = \mathbb{E}_{s\sim d^\pi,\,a\sim\pi_\theta}[\nabla_\theta\log\pi_\theta(a|s) Q^\pi(s,a)]$ In standard actor-critic (AC) frameworks, this expectation is estimated by sampling both states $s$ and executed actions $a$ . The MAC algorithm explores the equivalence: $\mathbb{E}_{a\sim\pi_\theta}\big[\nabla_\theta\log\pi_\theta(a|s) Q^\pi(s,a)\big] = \sum_{a\in\mathcal{A}} \pi_\theta(a|s) \nabla_\theta\log\pi_\theta(a|s) Q^\pi(s,a)$ By pushing the sum inside the expectation over states, MAC computes the policy gradient for each sampled state by analytically averaging over all actions—using the explicit form of $\pi_\theta(a|s)$ and $Q^\pi(s,a)$ —rather than relying on the particular action sampled. This results in a policy gradient estimator of the form: $\nabla_\theta J_{\rm MAC}(\theta) = \mathbb{E}_{s\sim d^\pi}\left[\sum_{a\in\mathcal{A}} \pi_\theta(a|s) \nabla_\theta\log\pi_\theta(a|s) Q^\pi(s,a)\right]$ This estimator applies to discrete $\mathcal{A}$ and can be implemented efficiently if the action set is modest in size (Allen et al., 2017).

2. Theoretical Properties: Variance Reduction

The MAC gradient estimator is unbiased: it corresponds exactly to the true policy gradient as per the policy-gradient theorem. However, by replacing stochastic sampling over actions with an exact sum (or expectation), MAC eliminates the variance associated with action sampling. Specifically, for parameter component $i$ , the MAC estimator

$\mathrm{MAC}_i = \frac{1}{T} \sum_{t=1}^T \sum_{a} \pi_\theta(a|s_t) \nabla_{\theta_i}\log\pi_\theta(a|s_t) \widehat Q(s_t, a; \omega)$

has strictly lower variance than its standard actor-critic counterpart, as long as $\pi_\theta(\cdot|s)$ is not deterministic, assuming independence of $\widehat Q(s,a)$ errors across $(s, a)$ . This follows directly from Jensen's inequality applied to the variance across the discrete action space (Allen et al., 2017). No additional baseline function is required—subtracting an action-independent baseline cancels out exactly in the MAC formulation.

3. Implementation and Algorithmic Steps

A high-level algorithmic outline for the discrete-action MAC in RL is as follows (Allen et al., 2017):

Initialize policy parameters $\theta$ and critic parameters $\omega$ .
Rollout the current policy to collect a batch of $T$ state transitions $\{(s_t, a_t, r_t, s_{t+1})\}$ .
Critic Update: Fit $\widehat Q(s, a; \omega)$ (e.g., by TD(0), TD( $\lambda$ ), or Monte Carlo) to approximate $Q^\pi(s,a)$ .
Policy Update: For each $s_t$ in the batch, compute the policy gradient

$g_t = \sum_{a \in \mathcal{A}} \pi_\theta(a|s_t) \nabla_\theta\log\pi_\theta(a|s_t)\,\widehat Q(s_t,a;\omega)$

Perform a gradient ascent step: $\theta \leftarrow \theta + \alpha \cdot (1/T)\sum_{t=1}^T g_t$ .

Repeat until convergence, optionally decaying the step size.

The MAC framework naturally applies to cases where actions are discrete and $\mathcal{A}$ is not prohibitively large. Over-sampling critic updates per policy update can further enhance stability.

4. Applications in Mean-Field Control and Mean-Field Games

The MAC nomenclature has also been widely adopted in continuous-time reinforcement learning for mean-field control (MFC) and mean-field games (MFG). These settings generalize the single-agent RL context by introducing population-level effects (state distribution $\mu_t$ evolving with the agents).

In continuous-time mean-field MAC algorithms (Pham et al., 2023, Frikha et al., 2023), the state process evolves as a McKean–Vlasov SDE: $dX_t = b(t, X_t, \mu_t, \alpha_t)\,dt + \sigma(t, X_t, \mu_t, \alpha_t)\,dW_t$ The learning task is to minimize $\mathbb{E}[\int_0^T f(X_t, \mu_t, \alpha_t)\,dt + g(X_T, \mu_T)]$ over randomized policies $\pi_\theta$ .

Key features distinctive to mean-field MAC approaches:

Parameterization: The policy (“actor”) is typically a Gaussian whose mean is represented by a moment neural network (moment-NN) depending on $x$ and low-order empirical moments of $\mu_t$ . The critic is also a moment-NN approximating the value function over the Wasserstein space of distributions.
Policy gradient: Incorporates both standard policy-gradient terms and additional terms arising from the mean-field (distributional) dependence, involving functional derivatives and explicit mean-field operators.
Critic loss: Enforces a martingale (HJB) condition using temporal-difference losses adapted to Wasserstein space inputs.
Monte Carlo over distributions: Training samples entire trajectories of particle systems to approximate distributional expectations.

The essential computation remains: MAC marginalizes action randomness in the policy gradient, using either exact summation or analytic expectation, which is particularly tractable in the Gaussian parameterization context for continuous action spaces.

5. Empirical Findings and Benchmarks

Classic RL (Discrete-action, Non-Mean-Field):

MAC achieved faster learning than REINFORCE, Advantage REINFORCE, actor-critic, and advantage actor-critic baselines on CartPole; matched top baselines on LunarLander.
In six Atari 2600 games (Beamrider, Breakout, Pong, Q*bert, Seaquest, SpaceInvaders), MAC was competitive with state-of-the-art policy search methods (TRPO, Evolution Strategies, A2C), frequently outperforming or matching them after 50 million frames.
Lower empirical variance translated to smoother learning curves and higher sample efficiency, with particularly stable updates in high-variance games (Allen et al., 2017).

Mean-Field Control and Games:

Moment-NN MAC delivers sub-percent errors for linear-quadratic (LQ) models and robust tracking for nonlinear settings, even in two and three-dimensional MFC problems (Pham et al., 2023).
In large particle regimes ( $M \sim 10^4$ or higher) and with moderate moment order (L=2 or 3), MAC’s learned policies and value functions closely match analytic solutions.
Actor-critic MAC in discrete-time linear-quadratic MFGs achieves provable global linear convergence to Nash equilibria, provided standard LQ-MFG structural assumptions are satisfied (Fu et al., 2019).

A summary table of select empirical domains is given below:

Domain	MAC Variant	Key Outcome
CartPole, LunarLander	Discrete-action MAC	Faster or equal learning; lower gradient variance
Atari 2600 (6 games)	Discrete-action MAC	Competitive/stable vs TRPO, ES, A2C (after 50M frames)
Systemic-risk LQ	Moment-NN MAC	Sub-percent value errors, matched control trajectories
1D/2D/3D LQ optimal trade	Moment-NN MAC	Errors 1–2%; robust for nonlinear, controlled-volatility
LQ mean-field games	Linear MAC (MFG)	Provable linear convergence to Nash equilibria

6. Connections to Broader RL and Mean-Field Literature

The MAC approach formalizes a long-standing method for reducing variance in policy gradient estimators by replacing stochastic action sampling with analytical averaging, made feasible by tractable action spaces or efficient parameterizations. In RL, MAC is especially effective for discrete, moderate-size action sets. In mean-field settings, the combination of MAC with moment-NNs enables function approximation on spaces of distributions, a central technical problem in mean-field control and games.

The MAC formulation is distinct from baseline subtraction, advantage estimation, or entropy regularization. It is orthogonal to improvements in critic learning, function approximation architecture, or policy regularization. In mean-field game theory, MAC supports model-free learning of Nash equilibria under linear-quadratic structure with global convergence guarantees and extends naturally to high-dimensional, nonlinear, and volatility-controlled settings.

7. Limitations and Computational Considerations

MAC’s principal advantage is variance reduction without extra bias. However, the inner expectation over actions or analytic computation of integrals/sums can be prohibitive for large discrete or high-dimensional continuous action spaces. In MFC/MFG, efficiency is preserved by parameterizing policies and critics with moment-NNs and fitting over low-order empirical moments, but the computational cost scales with the number of particles and the order of moments used. Training in high dimensions or with high-order moments may require considerable compute time (reported as $\sim 4 \times 10^4$ to $1.5 \times 10^5$ seconds on V100 GPUs for certain mean-field problems) (Pham et al., 2023). A plausible implication is that scalability to large-scale real-world MFC depends on further advances in neural architecture and sampling strategy.

References

(Allen et al., 2017) "Mean Actor Critic" — original discrete-action MAC in RL.
(Pham et al., 2023) "Actor critic learning algorithms for mean-field control with moment neural networks" — continuous-time mean-field MAC with moment-NN parameterization.
(Frikha et al., 2023) "Actor-Critic learning for mean-field control in continuous time" — continuous-time mean-field MAC with Wasserstein space parametrization.
(Fu et al., 2019) "Actor-Critic Provably Finds Nash Equilibria of Linear-Quadratic Mean-Field Games" — discrete-time MFG with provable convergence.

Markdown Upgrade to Chat

References (4)

Mean Actor Critic (2017)

Actor critic learning algorithms for mean-field control with moment neural networks (2023)

Actor-Critic learning for mean-field control in continuous time (2023)

Actor-Critic Provably Finds Nash Equilibria of Linear-Quadratic Mean-Field Games (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mean Actor-Critic (MAC).

Mean Actor-Critic (MAC) Methods

1. The Mean Actor-Critic Principle in Reinforcement Learning

2. Theoretical Properties: Variance Reduction

3. Implementation and Algorithmic Steps

4. Applications in Mean-Field Control and Mean-Field Games

5. Empirical Findings and Benchmarks

6. Connections to Broader RL and Mean-Field Literature

7. Limitations and Computational Considerations

References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Mean Actor-Critic (MAC) Methods

1. The Mean Actor-Critic Principle in Reinforcement Learning

2. Theoretical Properties: Variance Reduction

3. Implementation and Algorithmic Steps

4. Applications in Mean-Field Control and Mean-Field Games

5. Empirical Findings and Benchmarks

6. Connections to Broader RL and Mean-Field Literature

7. Limitations and Computational Considerations

References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research