Actor-Critic Reinforcement Learning

Updated 30 December 2025

Actor-critic reinforcement learning is a framework that defines a dual-network architecture where the actor selects actions and the critic estimates their value, ensuring efficient policy improvement.
The Mean Actor-Critic (MAC) method computes the gradient as an expectation over actions, significantly reducing variance compared to traditional sampling methods.
Implementations typically utilize neural networks optimized by methods like RMSProp or Adam, enabling effective learning in discrete and continuous control tasks.

Actor-critic reinforcement learning (RL) algorithms constitute a fundamental class of policy-search methods that combine value-function estimation (critic) with explicit policy improvement (actor). The actor specifies actions via a parameterized policy, while the critic provides feedback by estimating value (typically the action-value function) to guide the actor’s updates. This architectural separation enables both efficient policy iteration and variance-reduced gradient estimation, and underpins most modern deep RL methods for discrete and continuous control.

1. Fundamental Architecture and Variance Reduction

In canonical policy-gradient RL, the agent seeks to maximize the expected (discounted) return: $J(\theta) = \mathbb{E}_{\tau \sim Pr(\tau|\theta)} \left[G(\tau)\right]$ where $\tau$ is a trajectory and $G(\tau)$ is the accumulated discounted reward. The basic score-function (REINFORCE) gradient estimator,

$\nabla_\theta J = \mathbb{E}_{s \sim d^\pi, a \sim \pi}\left[\nabla_\theta \log \pi(a|s;\theta) Q^\pi(s,a)\right]$

is unbiased but exhibits high variance since it uses sampled returns as surrogates for the $Q$ -function.

Actor-critic methods directly reduce this variance by learning a parametric critic $\widehat{Q}(s,a;\omega)$ and using it instead: $\nabla_\theta J \approx \frac{1}{T} \sum_{t=1}^{T} \widehat{Q}(s_t, a_t; \omega)\, \nabla_\theta \log \pi(a_t|s_t; \theta)$ possibly leveraging a baseline. However, this estimator still suffers from variance due to sampling a single action $a_t$ at each state $s_t$ .

2. Mean Actor-Critic (MAC) and the Expectation over Actions

The Mean Actor-Critic (MAC) method (Allen et al., 2017) addresses the action-sampling variance by forming the gradient with an explicit expectation over the action space: $g_t^{\text{MAC}} = \sum_{a \in \mathcal{A}} \nabla_\theta \pi(a|s_t; \theta) \, \widehat{Q}(s_t, a; \omega)$ Empirically, MAC exhibits significantly reduced policy gradient variance compared to standard actor-critic estimators, while maintaining identical bias properties (the bias being determined solely by the critic’s approximation error). The reduction is strict unless the policy is deterministic. This construction enables stable and data-efficient policy optimization, particularly evident in high-dimensional or discrete-action spaces, as verified in Atari and classic control benchmarks (Allen et al., 2017).

3. Algorithmic Structure and Implementation

Typical implementations instantiate two function-approximators: a policy network (actor), which outputs $\pi(a|s; \theta)$ , and an action-value network (critic), outputting $\widehat{Q}(s,a; \omega)$ . Training proceeds as follows:

Data collection: Roll out trajectories under the current policy.
Critic update: Fit $\widehat{Q}$ to n-step returns, using e.g., temporal-difference or Monte Carlo targets.
Actor update: Compute gradients by backpropagation through the policy network, using either (a) the sampled-action estimator or (b) the MAC expectation over actions.
Parameter update: Apply optimization steps (e.g., RMSProp, Adam) to both actor and critic.

For discrete-action continuous-state settings, the MAC estimator is computed efficiently as a weighted sum over action gradients, leveraging the policy output softmax (Allen et al., 2017).

4. Bias, Variance, and Theoretical Properties

Both classical and mean actor-critic use the same critic estimator, so their gradient bias (due to $\widehat{Q} - Q^\pi$ ) is algebraically identical. The crucial difference lies in variance:

For AC, variance arises from action sampling: $\text{Var}[X(s,a)]$ with $X(s,a) = \nabla_\theta \log \pi(a|s) \widehat{Q}(s,a)$ .
For MAC, the per-state estimate $Y(s)=\mathbb{E}_a[X(s,a)]$ is deterministic, so $\text{Var}[MAC] \leq \text{Var}[AC]$ , with equality only for deterministic policies.

By Jensen’s inequality, this variance reduction is a strict improvement unless the actor has collapsed to a delta function.

5. Empirical Performance and Scalability

Empirical results highlight MAC's competitive or superior performance on classic control (CartPole, LunarLander) and Atari games. For instance, in CartPole, MAC achieves $\sim178$ average steps versus $\sim157$ for advantage actor-critic (Adv-AC), with faster early learning. On a 6-game Atari subset, MAC matches or exceeds A2C and is competitive with TRPO and Evolutionary Strategies. Importantly, the approach scales gracefully to high-dimensional observations using deep convolutional architectures (Allen et al., 2017).

6. Practical Considerations: Networks, Regularization, and Hyperparameters

For high-dimensional tasks (e.g., Atari), the architectures typically comprise three convolutional layers followed by a large fully connected layer, with two output heads—one for logits (policy), one for $Q$ -values. The policy loss is a MAC-specific cross-entropy: $L_\pi = -\sum_a \pi(a|s) \widehat{Q}(s,a)$ and may include entropy regularization $-\eta H(\pi(\cdot|s))$ (e.g., $\eta \approx 10^{-3}$ ). Practical optimization uses RMSProp or Adam with learning rates in $[10^{-4}, 1.5 \times 10^{-3}]$ , and batch sizes from 5–20 transitions per update. MAC does not require an explicit baseline, as it cancels algebraically in the sum over action gradients.

7. Limitations and Extensions

The MAC method is best suited to discrete-action settings, as summing over all actions remains tractable. For continuous-action domains, alternate strategies such as sampled-action gradients, deterministic policy gradients, or extensions utilizing second-order information (as in Guide Actor-Critic (Tangkaratt et al., 2017)) are necessary. MAC’s theoretical variance reduction holds independently of the critic’s quality—as critic variance (or bias) rises, so does the variance (or bias) of both MAC and AC. In the limit of perfect critic estimation, MAC yields the lowest possible variance among sample-based policy gradient estimators for discrete actions.

References:

“Mean Actor-Critic” (Allen et al., 2017)

Markdown Upgrade to Chat

References (2)

Mean Actor Critic (2017)

Guide Actor-Critic for Continuous Control (2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Actor-Critic Reinforcement Learning Algorithm.