Actor-Critic RL Algorithms

Updated 10 July 2025

Actor-critic algorithms are reinforcement learning methods that combine policy evaluation and improvement by simultaneously training an actor and a critic.
They employ a two-timescale update where the critic uses temporal-difference errors to guide the actor's policy gradient in a stable and efficient manner.
Empirical results in domains like Atari and robotics demonstrate their practical effectiveness, while ongoing research addresses variance reduction and sample complexity.

Actor-critic algorithms form a fundamental class of methods in reinforcement learning (RL) that address sequential decision problems by jointly learning value functions (for policy evaluation) and policies (for policy improvement). These algorithms have evolved to address challenges related to stability, sample efficiency, exploration, representation learning, and practical deployment in high-dimensional and continuous control domains.

1. Core Principles and Algorithmic Structure

Actor-critic frameworks premise the simultaneous training of two key components:

Actor: Parameterizes the policy, typically as a differentiable function μ(u|x, θ) (for stochastic policies) or as a deterministic function in certain variants.
Critic: Estimates value functions—such as the state value, action value, or advantage—usually via function approximation (linear or deep neural networks). The critic guides the actor by providing gradient signals through temporal-difference (TD) error or other value-based metrics.

A canonical update structure involves at each time step (or episode):

The actor samples actions from the policy.
The critic evaluates these actions via temporal difference learning:

$d_n := r(x_n) - \tilde{\eta}_n + h(x_{n+1}, w_n) - h(x_n, w_n)$

where $h(x,w)$ is the critic's value estimate and $\tilde{\eta}_n$ the running estimate of the average reward.

The actor is updated using the likelihood-ratio (policy gradient) weighted by the TD error:

$\theta_{n+1} = \theta_n + \gamma_n \psi(x_n, u_n, \theta_n) d_n$

where $\psi(x, u, \theta)$ is the gradient $\nabla_\theta \log \mu(u|x, \theta)$ (0909.2934).

2. Convergence Properties and Theoretical Guarantees

Convergence of actor-critic algorithms depends on the chosen timescales for actor and critic updates, the properties of the function approximators, and technical conditions on the Markov process.

Two-timescale methods: The critic is updated with a larger stepsize (fast timescale) than the actor. This decoupling allows the critic to track the policy effectively, resulting in convergence to a local optimum under standard regularity assumptions. Analyzed using stochastic approximation and ODE techniques, these methods converge almost surely to a local maximum in the average reward or closely related criterion (1310.3697).
Single timescale methods: Actor and critic share the same step-size schedule. The convergence in this case is only to a neighborhood of a local maximum, with the neighborhood's radius dependent on step-size constants and approximation error (0909.2934).
Finite-time analysis: For target-based actor-critic methods featuring a third timescale (such as those with Polyak-averaged target networks), convergence is to a ball whose size is determined by approximation bias. The worst-case sample complexity for reaching an $\epsilon$ -approximate stationary point can increase to $\mathcal{O}(\epsilon^{-3}\ln^3(1/\epsilon))$ (2106.07472).

These results typically rely on conditions such as irreducibility and aperiodicity of the induced Markov chain, full column rank in the linear function approximator, and boundedness of gradient and policy derivatives.

3. Function Approximation and Feature Design

Modern actor-critic algorithms employ both linear and nonlinear function approximators:

Linear function approximation: Enables tractable analysis and is sufficient for certain domains (0909.2934, 1310.3697). Critic updates utilize eligibility traces, with features $\phi(x)$ designed to be full-rank and bounded.
Neural network-based critics/actors: Widely used for high-dimensional or continuous spaces. Recent works model both actor and critic as deep networks, allowing representation learning that evolves with training (beyond the static regime of linear or kernel methods). Overparameterized two-layer networks capture data-adaptive features, with dynamics analyzed in mean-field/Wasserstein perspectives (2112.13530).
Distributional representations: Distributional actor-critic algorithms (e.g., GMAC) model the full value distribution rather than only its expectation, typically leveraging Gaussian mixture models and Wasserstein or Cramér distances for update objectives (2105.11366).

Feature design—especially the use of compatible features—ensures that the approximated value function does not introduce bias to the policy gradient, particularly in settings involving variance-penalized objectives (1310.3697).

4. Exploration, Stability, and Advanced Variants

Several innovations enhance the robustness, exploration capability, and sample efficiency of actor-critic algorithms:

Variance Reduction: Mean Actor-Critic (MAC) eliminates sampling variance in the policy gradient by analytically averaging over all actions, rather than relying solely on sampled actions (1709.00503).
Bootstrapped Critics and Target Networks: Polyak-averaged targets and multi-critic variants (e.g., Triple Q in OPAC) mitigate overestimation bias and stabilize learning, at the expense of increased sample complexity (2012.06555, 2012.06555).
Exploration by Intrinsic Rewards: Intrinsic Plausible Novelty Score (IPNS) modules assign exploration bonuses based on both state novelty (measured in a latent embedding via autoencoders) and potential to improve the policy (via the value estimate). This approach yields marked gains in sample efficiency and stability (2210.00211).
Optimism in the Face of Uncertainty: Algorithms like Optimistic Actor-Critic (OAC) use upper confidence bounds computed from bootstrapped critics to shift the exploration distribution, correcting for the tendency toward under-exploration in Gaussian policy settings (1910.12807).
Meta-gradient and Self-Tuning Approaches: Methods such as STAC employ meta-gradients to adapt critical hyperparameters online (including discount factors, entropy coefficients, and V-trace mixing parameters), reducing the need for manual tuning and improving generalization (2002.12928).
Risk-sensitive and Mean-Field Control: Variance-adjusted objectives (1310.3697) address risk control in decision making. Recent mean-field actor-critic algorithms handle problems where the state is itself a distribution, leveraging moment neural networks for efficient approximation (2309.04317).

5. Empirical Performance and Practical Applications

Empirical studies consistently validate advanced actor-critic algorithms across a diverse range of benchmarks:

Atari and MuJoCo: Algorithms such as RLS-based A2C (2201.05918), STAC/STACX (2002.12928), GMAC (2105.11366), OPAC (2012.06555), OAC (1910.12807), and PAC-Bayesian SAC (2301.12776) demonstrate state-of-the-art sample efficiency and/or final performance across classical control, Atari, and continuous control tasks.
Robot manipulation and goal-based RL: Multi-actor-critic schemes (AACHER) integrated with Hindsight Experience Replay yield substantial boost in sparse-reward environments. The averaging of multiple actors and critics is found to bolster robustness and stability in physical robot tasks such as FetchPush and FetchPickAndPlace (2210.12892).
Simulation-based optimization and design: Actor-Critic frameworks serve as sampling-based design optimizers when gradients are unavailable, with applications spanning adversarial attacks, robot control, and surrogate-based design in engineering (2111.00435).

6. Limitations, Trade-offs, and Future Directions

Key trade-offs and limitations persist:

Single timescale AC algorithms yield faster initial learning but guarantee only convergence to a neighborhood (whose size depends on step-sizes and function-approximation error) rather than the optimum (0909.2934).
Adding target networks or increasing the number of critics usually improves stability and reduces bias, at the expense of slower theoretical sample complexity (2106.07472).
More aggressive greedification in the value update (as in value-improved AC) can improve empirical performance but may demand tuning to avoid instability due to overestimation bias (2406.01423).

Ongoing research focuses on:

Further reducing variance and improving sample efficiency by adaptive critic/actor step-sizes and exploration schedules.
Extending theoretical analyses to nonlinear, deep network critics and multi-timescale learning in the mean-field or infinite-width regime (2112.13530).
Incorporating risk-awareness, robustness, and meta-learning as first-class objectives in large-scale real-world domains.
Exploring hierarchical and multi-agent approaches, as well as bridging simulation-based optimization with RL-driven design (2111.00435).

Actor-critic algorithms thus remain central to both the theory and practice of modern reinforcement learning, supported by a growing body of theoretical guarantees, algorithmic innovations, and empirical validations across increasingly diverse and complex domains.