Actor-Critic Reinforcement Learning
- Actor-critic reinforcement learning is a framework that defines a dual-network architecture where the actor selects actions and the critic estimates their value, ensuring efficient policy improvement.
- The Mean Actor-Critic (MAC) method computes the gradient as an expectation over actions, significantly reducing variance compared to traditional sampling methods.
- Implementations typically utilize neural networks optimized by methods like RMSProp or Adam, enabling effective learning in discrete and continuous control tasks.
Actor-critic reinforcement learning (RL) algorithms constitute a fundamental class of policy-search methods that combine value-function estimation (critic) with explicit policy improvement (actor). The actor specifies actions via a parameterized policy, while the critic provides feedback by estimating value (typically the action-value function) to guide the actor’s updates. This architectural separation enables both efficient policy iteration and variance-reduced gradient estimation, and underpins most modern deep RL methods for discrete and continuous control.
1. Fundamental Architecture and Variance Reduction
In canonical policy-gradient RL, the agent seeks to maximize the expected (discounted) return: where is a trajectory and is the accumulated discounted reward. The basic score-function (REINFORCE) gradient estimator,
is unbiased but exhibits high variance since it uses sampled returns as surrogates for the -function.
Actor-critic methods directly reduce this variance by learning a parametric critic and using it instead: possibly leveraging a baseline. However, this estimator still suffers from variance due to sampling a single action at each state .
2. Mean Actor-Critic (MAC) and the Expectation over Actions
The Mean Actor-Critic (MAC) method (Allen et al., 2017) addresses the action-sampling variance by forming the gradient with an explicit expectation over the action space: Empirically, MAC exhibits significantly reduced policy gradient variance compared to standard actor-critic estimators, while maintaining identical bias properties (the bias being determined solely by the critic’s approximation error). The reduction is strict unless the policy is deterministic. This construction enables stable and data-efficient policy optimization, particularly evident in high-dimensional or discrete-action spaces, as verified in Atari and classic control benchmarks (Allen et al., 2017).
3. Algorithmic Structure and Implementation
Typical implementations instantiate two function-approximators: a policy network (actor), which outputs , and an action-value network (critic), outputting . Training proceeds as follows:
- Data collection: Roll out trajectories under the current policy.
- Critic update: Fit to n-step returns, using e.g., temporal-difference or Monte Carlo targets.
- Actor update: Compute gradients by backpropagation through the policy network, using either (a) the sampled-action estimator or (b) the MAC expectation over actions.
- Parameter update: Apply optimization steps (e.g., RMSProp, Adam) to both actor and critic.
For discrete-action continuous-state settings, the MAC estimator is computed efficiently as a weighted sum over action gradients, leveraging the policy output softmax (Allen et al., 2017).
4. Bias, Variance, and Theoretical Properties
Both classical and mean actor-critic use the same critic estimator, so their gradient bias (due to ) is algebraically identical. The crucial difference lies in variance:
- For AC, variance arises from action sampling: with .
- For MAC, the per-state estimate is deterministic, so , with equality only for deterministic policies.
By Jensen’s inequality, this variance reduction is a strict improvement unless the actor has collapsed to a delta function.
5. Empirical Performance and Scalability
Empirical results highlight MAC's competitive or superior performance on classic control (CartPole, LunarLander) and Atari games. For instance, in CartPole, MAC achieves average steps versus for advantage actor-critic (Adv-AC), with faster early learning. On a 6-game Atari subset, MAC matches or exceeds A2C and is competitive with TRPO and Evolutionary Strategies. Importantly, the approach scales gracefully to high-dimensional observations using deep convolutional architectures (Allen et al., 2017).
6. Practical Considerations: Networks, Regularization, and Hyperparameters
For high-dimensional tasks (e.g., Atari), the architectures typically comprise three convolutional layers followed by a large fully connected layer, with two output heads—one for logits (policy), one for -values. The policy loss is a MAC-specific cross-entropy: and may include entropy regularization (e.g., ). Practical optimization uses RMSProp or Adam with learning rates in , and batch sizes from 5–20 transitions per update. MAC does not require an explicit baseline, as it cancels algebraically in the sum over action gradients.
7. Limitations and Extensions
The MAC method is best suited to discrete-action settings, as summing over all actions remains tractable. For continuous-action domains, alternate strategies such as sampled-action gradients, deterministic policy gradients, or extensions utilizing second-order information (as in Guide Actor-Critic (Tangkaratt et al., 2017)) are necessary. MAC’s theoretical variance reduction holds independently of the critic’s quality—as critic variance (or bias) rises, so does the variance (or bias) of both MAC and AC. In the limit of perfect critic estimation, MAC yields the lowest possible variance among sample-based policy gradient estimators for discrete actions.
References:
- “Mean Actor-Critic” (Allen et al., 2017)