Two-Timescale Critic-Actor Algorithm

Updated 12 October 2025

Two-timescale critic-actor algorithms are iterative methods that update policy (actor) and value (critic) with distinct fast and slow learning rates to optimize Markov decision processes.
They employ specialized step-size schedules and linear function approximation to ensure robust convergence and improved sample complexity in both discounted and average-reward settings.
Empirical studies demonstrate that these methods outperform standard actor-critic approaches in both unconstrained and constrained reinforcement learning benchmarks.

A two-timescale critic-actor algorithm is a class of stochastic, iterative algorithms for solving Markov decision processes where the policy parameters (actor) and value function parameters (critic) are updated in parallel, but with different learning rates—one “fast” and one “slow.” The literature distinguishes standard actor-critic (AC) methods, in which the critic updates on the faster timescale (emulating policy iteration), from their “reversed” counterparts, termed critic-actor (CA), where the actor operates on the faster timescale and the critic lags behind (emulating value iteration). Both paradigms are widely used in reinforcement learning and are theoretically and empirically analyzed for both discounted and average-reward settings under function approximation, as well as in constrained MDPs.

1. Algorithmic Foundations and Structure

Two-timescale critic-actor algorithms iteratively update the actor (policy parameters, θ) and critic (value function parameters, v), each with a dedicated step size schedule:

Faster timescale: The parameter (θ or v, depending on the scheme) employs a larger step size (e.g., αₜ), effecting quicker adaptation.
Slower timescale: The other parameter (v or θ) is updated with a smaller step size (e.g., βₜ, with αₜ/βₜ → 0 as t → ∞), thus perceiving the faster variable as nearly stationary within an epoch.

Distinguishing features of these algorithms include:

Critic-actor (CA): The actor (policy) is updated quickly to (approximately) minimize the Bellman error residual for the current value function (emulating value iteration), while the critic (value) tracks the resulting evolving policy more slowly (Bhatnagar et al., 2022, Panda et al., 2 Feb 2024, Panda et al., 5 Oct 2025).
Actor-critic (AC): The critic is updated rapidly to evaluate the current policy, while the slower actor update takes (approximate) gradient steps towards policy improvement (emulating policy iteration) (Dai et al., 2017, Zhang et al., 2019, Wu et al., 2020).

The general recursion, specializing to the average-reward setting with function approximation, is (Panda et al., 5 Oct 2025, Panda et al., 2 Feb 2024):

$\begin{aligned} \text{Actor update:} &\quad \theta_{t+1} = \mathcal{P}_{\Theta}\left(\theta_t + a(t) \cdot \delta_t \cdot G_t^{-1} \Psi(s_t, a_t)\right) \ \text{Critic update:} &\quad v_{t+1} = \Gamma\left(v_t + b(t) \cdot \delta_t \cdot f(s_t)\right) \ \text{Average reward:} &\quad L_{t+1} = L_t + d(t)(r_t - L_t) \end{aligned}$

where:

$\delta_t = r_t - L_t + v_t^\top (f(s_{t+1}) - f(s_t))$ is the TD error,
$f(\cdot)$ denotes the critic’s feature vector for linear function approximation,
$G_t$ is an empirical Fisher information matrix (for natural gradient),
$a(t)$ , $b(t)$ , $d(t)$ are step size sequences with $a(t) \gg b(t)$ ,
$\Psi(s,a)$ is the compatible feature for the actor parameterization,
$\mathcal{P}_\Theta$ and $\Gamma$ are projections for boundedness/stability.

In constrained settings, a Lagrangian multiplier γ is typically introduced, updated on an even slower timescale (Panda et al., 5 Oct 2025).

2. Convergence Theory and Finite-Time Analysis

Non-asymptotic stochastic approximation analysis underpins the theoretical guarantees for these algorithms:

For unconstrained critic-actor with linear function approximation, the MSE of the critic is shown to satisfy

$\frac{1}{T} \sum_{k=0}^{T-1} \mathbb{E}\left[\|v_k - v^*(\theta_k)\|^2\right] = O\left(\log^2 T \cdot T^{\sigma - 2\nu} + T^{2\sigma - \nu - 1} + \log^2 T \cdot T^{-3\nu + 2\sigma}\right)$

with step size exponents optimally chosen as $\nu \approx 0.5$ , $\sigma \approx 0.51$ (Panda et al., 2 Feb 2024).

Resulting sample complexity for the critic to reach mean squared error below ε is

$T = \tilde{O}(\epsilon^{-(2+\delta)})$

with δ arbitrarily small, approaching the $O(\epsilon^{-2})$ rate for strongly convex stochastic approximation (Panda et al., 5 Oct 2025, Panda et al., 2 Feb 2024). This improves over two-timescale actor-critic ( $O(\epsilon^{-2.5})$ ) (Wu et al., 2020, Chen et al., 2022).

For constrained problems with a three-timescale architecture (critic, actor, Lagrange multiplier), the same MSE rate is maintained, and using modified learning rates (e.g., logarithmic factors in step sizes), the sample complexity for the critic reaches $\tilde{O}(\epsilon^{-2})$ (Panda et al., 5 Oct 2025).

The analyses use tools such as ordinary differential equation (ODE) tracking, telescoping sums, mixing time estimates (for Markovian sampling), and Lyapunov techniques; critical assumptions ensure uniform ergodicity, bounded features, and Lipschitz parameterizations.

3. Function Approximation and Algorithm Design Considerations

Linear function approximation is the canonical setting for critic-actor/actor-critic analysis. The value function is parameterized as $v(s) \approx \phi(s)^\top v$ for fixed feature mapping φ. The policy (actor) can be parameterized linearly or nonlinearly—when using natural gradients, the actor update direction is appropriately preconditioned (Panda et al., 5 Oct 2025).

Stability and convergence hinge on several design elements:

Learning rates are chosen so the “actor” timescale is strictly faster than “critic,” with ratios decreasing to zero.
Projections (e.g., Γ) are applied to maintain bounded iterates.
Regularization, such as path regularization (Dai et al., 2017), or compatible feature construction for the policy gradient, controls the bias-variance tradeoff and adaptation with function approximation.
In constrained settings, Lagrange multipliers are updated with a further separated, slow stepsize (Panda et al., 5 Oct 2025).

In practice, the critic-actor paradigm can be extended to nonlinear function approximation (e.g., neural networks); however, theoretical analysis and guarantees are currently established only for linear critics, with neural network-based implementations reported to perform competitively in experiments (Panda et al., 2 Feb 2024, Panda et al., 5 Oct 2025).

4. Empirical Performance and Applications

Empirical validation spans classic tabular and continuous control benchmarks:

In classical problems (Frozen Lake, Blackjack, Acrobot), the critic-actor algorithm exhibits higher average rewards and more rapid convergence relative to standard actor-critic and PPO variants (Panda et al., 2 Feb 2024).
In constrained reinforcement learning settings (e.g., Safety-Gym’s SafetyAntCircle, SafetyCarGoal, SafetyPointPush), constrained natural critic-actor with optimized learning rates achieves comparable or superior average reward and better constraint satisfaction than constrained actor-critic or natural actor-critic methods (Panda et al., 5 Oct 2025).
These performance trends hold for both discrete (tabular) and large-scale settings with function approximation.

Tables reporting performance metrics typically include average reward and constraint costs, with the proposed critic-actor or its constrained/natural extensions performing best or on par with competitive baselines.

5. Algorithmic Implications and Theoretical Significance

The two-timescale critic-actor framework provides several advantages:

Sample efficiency: Superior or nearly optimal sample complexity for the critic and actor under mild function approximation conditions.
Robustness: Decoupling of actor and critic ensures robustness to approximation errors, especially as the critic is exposed to slower parameter drift. This is reflected in rigorous finite-time analysis and improved empirical stability.
Flexibility: Naturally accommodates additional components (baselines, constraints, Lagrange multipliers) by introducing further timescales, without loss of theoretical tractability (Panda et al., 5 Oct 2025).
Extension potential: Encourages algorithmic innovation; for example, employing target networks or multi-step lookahead operators as in recent deep RL, or addressing off-policy learning (Dai et al., 2017, Zhang et al., 2019).

From the value-iteration emulation in critic-actor (faster actor update) (Bhatnagar et al., 2022), to policy-iteration emulation in standard actor-critic (faster critic update), the choice of timescale reversal becomes algorithmically meaningful, often dictated by problem structure, scalability constraints, and stability considerations.

6. Future Directions and Open Problems

Active research directions, as indicated by the literature, include:

Further reduction of sample complexity, possibly approaching the information-theoretic lower bounds for reinforcement learning.
Non-asymptotic and high-probability analysis under non-linear function approximation and non-Markovian sampling processes.
Extension to decentralized or multi-agent settings where actor and critic “roles” can be distributed and the separation of timescales becomes a design parameter.
Evaluation in more realistic, high-dimensional, constrained environments potentially with safety, resource, or risk constraints explicitly modeled—where timescale separation may mitigate instability.
Improved theoretical understanding of the effects of step size adaptation and regularization on convergence rates, practical stability, and safe learning.

The two-timescale critic-actor algorithm is thus central in the modern theoretical and practical landscape of reinforcement learning, synthesizing developments from stochastic optimization, bilevel learning, and approximate dynamic programming across classic and modern RL domains.