Actor–Critic Algorithms

Updated 7 June 2026

Actor–Critic algorithms are reinforcement learning methods that combine an actor for policy optimization with a critic for value estimation to reduce variance and enhance stability.
They employ both on-policy and off-policy learning techniques, incorporating methods like natural gradients and advantage estimation to improve sample efficiency and convergence.
Advanced extensions, such as Actor–Advisor and dual actor–critic frameworks, further stabilize training and boost performance in complex tasks including continuous control and multi-agent environments.

Actor–critic algorithms are a class of reinforcement learning (RL) methods that integrate explicit policy optimization (actor) with value-function estimation (critic). The actor parameterizes and updates the policy, while the critic estimates relevant value functions to guide the policy improvements. This structure provides a powerful framework that unifies policy-gradient and value-based RL, enabling efficient variance reduction, stabilization, and off-policy learning. Actor–critic approaches have been foundational to advances in deep RL, continuous control, constrained/multi-agent RL, and variance-sensitive or risk-aware optimization.

1. Fundamental Structure: Actor, Critic, and Policy Gradient

The standard actor–critic setting models an agent interacting with an MDP defined by state space $\mathcal S$ , action space $\mathcal A$ , transition kernel $P$ , reward function $R$ , and discount factor $\gamma$ .

Actor: A parameterized policy $\pi_\theta(a|s)$ selects actions based on the current state. Policy parameters $\theta$ are adapted using estimates of the policy gradient. A canonical result is the policy gradient theorem:

$\nabla_\theta J(\theta) = \mathbb E_{s,a \sim d^\pi, \pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a|s) Q^\pi(s,a) \right]$

where $Q^\pi(s,a)$ is the state–action value function.

Critic: The critic estimates $Q^\pi(s,a)$ or substitutes it with an approximation $\mathcal A$ 0, with $\mathcal A$ 1 learned by (possibly off-policy) temporal-difference (TD) or Monte Carlo regression. Many architectures optimize the critic by minimizing the squared Bellman error:

$\mathcal A$ 2

Policy Update: The actor uses critic estimates to perform variance-reduced or bias-reduced policy optimization via policy-gradient or natural-gradient steps. Actor–critic alternation yields a generalized policy iteration (GPI) loop that allows stability and efficient credit assignment (Plisnier et al., 2019).

2. Algorithmic Variants and Design Principles

On-policy vs. Off-policy

Traditional (on-policy) actor–critic algorithms update both actor and critic using samples from the current policy, ensuring unbiased policy improvement. To improve sample efficiency, modern variants (e.g., DDPG, TD3, SAC, OPAC) employ replay buffers and separate the actor’s trajectory distribution from the critic’s, enabling off-policy training and reuse of past experience (Roy et al., 2020, Zhang et al., 2020).

A key design principle is ensuring that the critic either estimates $\mathcal A$ 3 (the value of the current policy) or that the gradient estimate remains unbiased despite critic–policy mismatches. The failure to do so may lead to bias and instability, especially in the presence of strong off-policy learning (Plisnier et al., 2019).

Policy-Gradient Estimation

The variance of policy-gradient estimators is a central challenge. The “sampled-action” estimator,

$\mathcal A$ 4

can exhibit high variance. Mean Actor–Critic (MAC) reduces variance by computing the analytical expectation over all actions:

$\mathcal A$ 5

Strictly lower variance occurs for stochastic policies (Allen et al., 2017).

Advantage Actor–Critic and Natural Gradients

Advantage Actor–Critic (A2C) and its deep variants (A3C, PPO, etc.) use the advantage $\mathcal A$ 6 for variance reduction.

Natural policy gradients approximate the step in the Fisher metric, which can be interpreted as an approximate greedy policy iteration step, offering geometric convergence under increasing step sizes and compatible policy parameterization (Chen et al., 2022).

Off-policy Correction and Emphatic Methods

For off-policy data, standard TD critics diverge under linear function approximation due to the “deadly triad.” Convergent actor–critic solutions leverage state-value critics with off-policy corrections, such as importance sampling, GTD(λ), and Emphatic-TD(λ). With appropriate actor traces and two-time-scale updates, policy-gradient algorithms are guaranteed to converge under off-policy training (Maei, 2018).

3. Advanced Architectures: Extensions and Stabilizations

Actor–Advisor

The Actor–Advisor architecture decouples the actor and critic: the actor is updated on unbiased Monte Carlo returns, while the sampling distribution is shaped by an off-policy critic via a softmax (policy shaping). This hybrid approach allows the actor to benefit from off-policy critic guidance for efficient exploration while retaining unbiased policy gradients and robust convergence (Plisnier et al., 2019).

Key features:

Actor update: always with MC returns, preserving unbiasedness
Critic: any off-policy Q*-oriented learner, e.g., Double DQN
Policy shaping: action sampling via mixture of learned policy and advisory softmax-Q policy
Generalizes to safe RL, domain-knowledge integration, and RL transfer/bootstrapping by varying the advice source

Value-Improved Actor–Critic

The Value-Improved Actor–Critic (VI-AC) framework introduces a second greedification operator applied solely for updating the value estimate in the critic, not to the policy itself. The actor continues with gradient-based updates, while the critic's target is computed by a more greedy operator, e.g., mean-top-k action selection. This hybrid yields a convergence guarantee under Generalized Policy Iteration theory and consistent gains over standard off-policy baselines in continuous control (Oren et al., 2024).

Stackelberg Actor–Critic

Interpreting actor–critic interaction through a Stackelberg game lens, one can formulate the actor as the leader and the critic as the follower. The Stackelberg actor update uses the total derivative:

$\mathcal A$ 7

This correction accounts for the response of the critic to changes in the actor, reducing cycling and improving convergence rates (Zheng et al., 2021).

Dual Actor–Critic and Lagrangian Frameworks

Dual Actor–Critic (Dual-AC) frames policy optimization as a two-player saddle-point game derived from the Bellman LP duality. An augmented Lagrangian with path regularization and multi-step bootstrapping stabilizes the minimax objective, enabling robust optimization under function approximation (Dai et al., 2017). Lagrangian- and constrained-optimization perspectives motivate primal–dual variants for constrained (multi-agent) RL (A. et al., 2015, Diddigi et al., 2019).

4. Sample Complexity, Convergence, and Bias-Variance Trade-offs

The convergence and sample complexity of actor–critic methods are determined by the interplay of policy-gradient step size, critic estimation error, horizon length, and function-approximation properties.

Sample complexity bounds: The rate $\mathcal A$ 8 or $\mathcal A$ 9, where $P$ 0 is the convergence exponent of the critic (e.g., $P$ 1 for finite-state TD(0), $P$ 2 for GTD) and the horizon/inner-loop lengths are chosen optimally (Kumar et al., 2019, Chen et al., 2022).
Bias–variance–representation trade-off: Discount factors and horizon control in the critic determine not only the variance of value estimates but also the representational complexity and learnability of the critic network (smaller $P$ 3 or horizon reduces variance and linearizes the value function at the expense of introducing some bias) (Zhang et al., 2020).
Single vs. two time-scale algorithms: Single time-scale actor–critic algorithms promise greater biological and computational plausibility but only guarantee convergence to a neighborhood of a local optimum, with the residual gradient proportional to algorithmic time-scaling factors and the feature-approximation error (0909.2934).
Variance-sensitive and risk-aware actor–critic: By extending compatible features and TD learning to second-order (variance) terms, variance-adjusted AC methods can optimize risk-sensitive objectives such as mean–variance trade-off in return, at the expense of slower convergence and higher sample requirements (Tamar et al., 2013).

5. Exploration, Intrinsic Rewards, and Domain-Specific Adaptations

Intrinsic Motivation and Novelty

Sample efficiency and thorough state-space coverage are improved by augmenting reward signals with intrinsic bonuses. Actor–critic exploration strategies include count/pseudocount bonuses, prediction-error-based curiosity, entropy maximization, and “plausible novelty” that combines state novelty (e.g., via latent autoencoder encodings) with value relevance, driving exploration systematically toward states both novel and promising (Banerjee et al., 2022).

Adversarial and Multi-Agent Learning

Adversarially Guided Actor–Critic (AGAC) introduces a third "adversary" policy trained to mimic the actor. The actor, penalized for being predictable to the adversary, is incentivized to generate strategies not easily predicted from prior experience, significantly boosting exploration and robustness in environments with sparse rewards or high diversity (Flet-Berliac et al., 2021).

Multi-agent actor–critic frameworks, especially in general-sum or constrained stochastic games, employ two-time-scale architectures and primal–dual Lagrangian relaxations to guarantee Nash or Stackelberg equilibrium seeking and enforce global constraints (e.g., via consensus updates in distributed variants) (Prasad et al., 2014, Diddigi et al., 2019).

6. Empirical Performance and Practical Recommendations

Empirical Benchmarks

Actor–critic algorithms consistently match or outperform pure value-based methods when equipped with robust off-policy critics, proper step-size schedules, and action–value-sensitive exploration. Empirical evaluations on MuJoCo, Atari, and control tasks show sample-efficiency gains, final-policy improvement, and robustness to hyperparameter tuning (Roy et al., 2020, Tasdighi et al., 2023, Banerjee et al., 2022, Wang et al., 2022).

Implementation Guidance

Critic discount tuning: For undiscounted objectives, setting critic discount $P$ 4 is empirically optimal due to improved representation learning and lower variance (Zhang et al., 2020).
Auxiliary tasks and head sharing: In actor–critic with discount mismatch, adding an auxiliary policy head achieves both unbiasedness and sample efficiency.
Recursive least squares (RLS) critics and natural gradients: RLS-based updates accelerate convergence of deep actor–critic algorithms and reduce sample complexity without significant computational overhead (Wang et al., 2022).
Policy improvement operators: Value-improvement steps (e.g., mean-top-k, clipped Gaussian) for the critic targets, combined with classical SAPIO steps for the actor, yield enhanced stability and optimality guarantees (Oren et al., 2024).

Limitations and Considerations

Convergence to local vs. global optima is not always guaranteed under function approximation, non-convexity, or single time-scale updates.
Off-policy training requires controlling bias, using compatible traces or emphatic measures.
Stabilization measures—such as dual averaging, mirror descent, path regularization, and entropy constraints—are often necessary under nonlinearity and batch learning regimes.

7. Extensions and Open Directions

Recent progress in actor–critic research encompasses areas such as:

PAC-Bayesian actor–critics, which employ generalization bounds for the critic, regularize learning, and leverage randomized critics for improved exploration (Tasdighi et al., 2023).
Compatible gradient estimators that circumvent critic–actor non-alignment in deep deterministic policy gradient architectures via zeroth-order (two-point