Actor-Critic (A2C) Algorithms

Updated 26 April 2026

Actor-Critic (A2C) is a reinforcement learning algorithm that combines policy gradient optimization with value function estimation to achieve low-variance, unbiased updates.
It utilizes a dual-network structure where the actor updates the policy via gradient ascent while the critic minimizes TD error to stabilize learning.
Recent variants such as distributional A2C, RLS-based methods, and momentum acceleration significantly enhance convergence rates and scalability across diverse RL tasks.

Actor-Critic (A2C) algorithms are a class of reinforcement learning (RL) methods that combine policy-gradient optimization with value function approximation. The A2C framework produces low-variance, unbiased estimators of the policy gradient by leveraging the critic's estimate of the value function, and the actor's stochastic policy optimization. The "advantage" formulation is optimal among control variate constructions conditioned only on state, and has become foundational for modern deep policy gradient techniques due to its efficiency, scalability, and extensibility across discrete and continuous, single-agent and multi-agent, synchronous and asynchronous RL domains (Benhamou, 2019, Peng et al., 2017, Li et al., 2018, Parisi et al., 2018, Bhatnagar et al., 2022, Xiao et al., 2022, Veldhuizen, 2022, Wang et al., 2022, Dong et al., 2024).

1. Theoretical Foundations and Control Variate Perspective

A2C methods build on the REINFORCE stochastic policy gradient, which seeks to maximize the expected return

$J(\theta) = \mathbb{E}_{\tau\sim\pi_\theta}[\,\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) R_t\,]$

where $R_t$ is the total discounted return from time $t$ . Direct estimation of this gradient suffers from high variance. Introducing a control variate---a baseline $b(s)$ ---preserves unbiasedness but reduces variance. The canonical actor-critic choice is $b(s) = V^\pi(s)$ , the value function under current policy, yielding the advantage $A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)$ .

Formally, for return random variable $R_t$ and the Hilbert space $\mathcal{L}^2$ on policy-generated trajectories:

The optimal $L^2$ projection of $R_t$ onto functions measurable with $R_t$ 0 is $R_t$ 1.
The optimal $R_t$ 2 projection onto functions of $R_t$ 3 is $R_t$ 4.

Thus, the A2C gradient estimator

$R_t$ 5

is provably optimal among all state-conditioned baselines. The variance reduction is quantified by

$R_t$ 6

(Benhamou, 2019).

2. Canonical A2C Algorithm and Implementation

A2C employs two parameterized networks: the actor $R_t$ 7, and the critic $R_t$ 8. The typical synchronous A2C algorithm processes fixed-length experience rollouts and performs simultaneous, but decoupled, parameter updates. The key computation steps are:

For each transition $R_t$ 9, compute the one-step TD error as an unbiased advantage estimate:

$t$ 0

The actor is updated via gradient ascent:

$t$ 1

The critic minimizes squared TD error:

$t$ 2

Updated networks are decoupled and typically synchronized in batch or mini-batch fashion, ensuring tractable parallelization and scalability (Peng et al., 2017, Veldhuizen, 2022). The use of the one-step TD error as an advantage estimator enables efficient online or on-policy data utilization, facilitating adaptive control even with non-stationary tasks (Veldhuizen, 2022).

3. Variance-Optimal Extensions and Momentum Accelerations

Recent research frames A2C as the optimal single-dimensional control variate estimator, with extensions to multi-dimensional variance reduction. By constructing auxiliary, zero-mean control variates $t$ 3 and solving for optimal weights $t$ 4, the gradient estimator

$t$ 5

achieves strictly lower variance. In empirical studies (Atari, MuJoCo), multi-control-variate A2C variants yield up to 50% reduction in gradient variance and 20–40% faster convergence compared to classical A2C (Benhamou, 2019).

To further accelerate critic convergence under function approximation, heavy-ball momentum (HB-A2C) introduces a momentum recursion into the critic's update,

$t$ 6

$t$ 7

Demonstrated convergence rates are $t$ 8, matching or surpassing prior single-time-scale A2C, given appropriate selection of $t$ 9. Theoretical analysis gives explicit guidelines for balancing initialization and stochastic approximation errors via the momentum factor (Dong et al., 2024).

4. Algorithmic Variants: Distributional, RLS, TD-Regularized, and Multi-Agent A2C

Multiple algorithmic improvements extend A2C's applicability and performance:

Distributional A2C (DA2C/QR-A2C): Replaces scalar Q-estimation with quantile-based return distributions. Critic outputs $b(s)$ 0 quantile values per action, trained via quantile regression minimizing Huber loss. Advantages are computed as differences in empirical quantiles, giving improved variance, robustness against multimodality, and enhanced stability on challenging domains such as Atari Assault. Optimal atom-count $b(s)$ 1 trades off variance reduction against compute (Li et al., 2018).
Recursive Least Squares A2C (RLSSA2C, RLSNA2C): Applies RLS updates to both the critic and the actor's hidden layers. RLSSA2C uses standard policy gradients; RLSNA2C further employs Kronecker-factored RLS and natural policy gradients for the policy parameter. Both report improved sample efficiency and final performance on Atari and MuJoCo, at the cost of moderate additional computation (Wang et al., 2022).
TD-Regularized Actor-Critic: Integrates a quadratic penalty on the critic’s TD-error into the actor’s loss, yielding

$b(s)$ 2

This prevents large, potentially destabilizing policy updates when the critic is inaccurate. Empirical results show increased stability and sample efficiency, especially under function approximation or high critic bias (Parisi et al., 2018).

Asynchronous and Multi-Agent A2C: Formulations where policies are optimized over temporally extended macro-actions, permitting agents to update asynchronously and independently. CTDE (Centralized Training for Decentralized Execution) variants with individualized centralized critics empirically outperform both primitive-action A2C and value-based alternatives, particularly on long-horizon, large-scale cooperative multi-agent tasks (Xiao et al., 2022).

5. Two-Time-Scale Analysis and Critic-Actor Reversal

Classic A2C architectures use a two-time-scale stochastic approximation schema: the critic (value function estimator) is updated on a faster time-scale (larger step-size) than the actor (policy parameters). This design ensures the actor perceives a quasi-static critic during updates. However, theoretical and empirical results demonstrate that reversing these time-scales---i.e., running the actor on the fast scale and critic on the slow scale---remains convergent and equivalent in limiting policy optimality and computational cost. This critic-actor ("CA") variant behaves as value iteration, pushing the policy greedily with respect to a slowly updated value function, and matches standard actor-critic ("AC") in both tabular and nonlinear approximator settings (Bhatnagar et al., 2022). This finding broadens the admissible space of time-scale assignments in contemporary A2C implementations.

6. Practical Applications and Empirical Performance

A2C and its variants have demonstrated efficacy across a broad spectrum of domains:

Robotics and Control: Successful PID parameter autotuning for simulated robotic arms demonstrates that A2C facilitates real-time, motion-level adaptation to varying target locations and task instances (Veldhuizen, 2022).
Dialogue and Imitation Learning: Adversarial A2C variants incorporate discriminator-driven intrinsic rewards for more efficient exploration and exploitation of expert-like behaviors in task-oriented dialogue management (Peng et al., 2017).
Atari and MuJoCo Benchmarks: Empirical evidence from Distributional A2C, RLS-based A2C, and momentum-accelerated A2C variants consistently show improved sample efficiency, learning stability, and final performance relative to vanilla A2C (Li et al., 2018, Wang et al., 2022, Dong et al., 2024, Benhamou, 2019).
Multi-Agent Domains: Macro-action A2C and its CTDE variants yield superior performance for large-scale, temporally extended, and asynchronous multi-agent learning tasks (Xiao et al., 2022).

7. Insights, Open Problems, and Future Directions

A2C provides provably optimal variance reduction for policy-gradient estimation within its control variate class, yet several challenges and research frontiers remain:

Variance Reduction: Multi-dimensional control variates, distributional critics, and critic-momentum continue to be active areas for reducing gradient variance in high-dimensional, sparse-reward settings (Benhamou, 2019, Li et al., 2018, Dong et al., 2024).
Critic Bias and Instability: Methods such as TD regularization and RLS-based critics address function approximation bias and learning instability, but require additional hyperparameter tuning and theoretical analysis, especially under deep nonlinear approximation (Wang et al., 2022, Parisi et al., 2018).
Multi-Agent and Asynchronous Learning: Extending A2C to dynamic decentralized and asynchronous architectures improves scalability, but robust off-policy extensions and autonomous macro-action discovery remain open problems (Xiao et al., 2022).
Theoretical Convergence: While two-time-scale convergence is well-established for tabular and linear settings (Bhatnagar et al., 2022, Wang et al., 2022), non-asymptotic guarantees for deep A2C with nonlinear network approximation remain an open research question (Li et al., 2018).
Distributional Critic Integration: Quantile-based and risk-sensitive A2C methods provide enhanced robustness and richer learning signals, but trade-offs between computational cost and learning stability must be explicitly managed (Li et al., 2018).

A2C remains a foundational and actively evolving method in modern reinforcement learning, underpinning many subsequent algorithmic developments and sophisticated extensions for both single and multi-agent, on-policy and off-policy, and model-free and model-based RL.