Single-Timescale Actor-Critic

Updated 21 October 2025

The algorithm is a reinforcement learning method where the actor and critic are updated concurrently with the same step size, enabling rapid adaptation.
It leverages a shared temporal difference signal to drive both policy and value updates, ensuring coherent functionality despite inherent dynamic errors.
Convergence is analyzed through ODE and Lyapunov methods, demonstrating stability to a neighborhood of optimality and potential for biological modeling.

A single-timescale actor-critic algorithm is a reinforcement learning (RL) method in which both the actor (policy optimizer) and the critic (value function estimator) are updated concurrently on the same time scale using updates of the same or proportional step sizes. Unlike traditional two-timescale schemes that require the critic to converge (or nearly converge) between successive actor updates, single-timescale methods interleave both updates in a single loop, reflecting the paradigm most often used in empirical deep RL systems. The single-timescale framework enables rapid adaptation, straightforward implementation, and possible biological relevance but has historically been more challenging to analyze from a theory perspective.

1. Algorithmic Structure

Classic single-timescale actor-critic algorithms update the policy and value function in an alternating fashion using samples from an ongoing trajectory. Consider an average-reward MDP with state space $\mathcal{X}$ , action space $\mathcal{U}$ , and parameterized policy $\mu(u|x,\theta)$ with parameters $\theta$ :

Critic update: Estimates the (differential) value function via temporal difference (TD) learning with linear function approximation. The value function is represented as $h(x, w) = \phi(x)^T w$ , where $\phi(x) \in \mathbb{R}^{L}$ is a feature vector, and $w\in\mathbb{R}^{L}$ is a learnable weight vector.
Actor update: The policy parameters $\theta$ are updated along the gradient of expected return, estimated using the same TD signal that drives the critic: \begin{align*} \tilde{\eta}{n+1} &= \tilde{\eta}{n} + \gamma_n \Gamma_\eta \left[r(x_n) - \tilde{\eta}n \right] \ w{n+1} &= w_n + \gamma_n \Gamma_w \, d(x_n, x_{n+1}, w_n) e_n \ \theta_{n+1} &= \theta_n + \gamma_n \psi(x_n, u_n, \theta_n) d(x_n, x_{n+1}, w_n) \end{align*} where $d(x, y, w) = r(x) - \eta(\theta) + h(y, w) - h(x, w)$ is the TD error, and $\psi(x, u, \theta) = \nabla_\theta \mu(u | x, \theta)/\mu(u | x, \theta)$ is the likelihood ratio gradient.

The critical distinction is that all updates run with the same learning rate schedule $\gamma_n$ (parallel scaling factors $\Gamma_\eta$ , $\Gamma_w$ can be used for normalization), yielding concurrent adaptation of all components (0909.2934). This structure obviates inner loops or separated step-size hierarchies between actor and critic.

2. Convergence Properties

The convergence of single-timescale actor-critic algorithms is established via ordinary differential equation (ODE) methods that analyze the joint stochastic recursion for $(\theta, w, \tilde{\eta})$ . The principal findings include:

The stochastic process tracks an ODE system coupled via the TD error and the stationary distribution under $\theta$ : \begin{align*} \dot{\theta} &= \nabla_\theta \eta(\theta) + \text{correction terms} \ \dot{w} &= \Gamma_w \left[A(\theta) w + b(\theta) + G(\theta)[\eta(\theta) - \tilde{\eta}]\right] \ \dot{\tilde{\eta}} &= \Gamma_\eta\left(\eta(\theta) - \tilde{\eta}\right) \end{align*} with $A(\theta), b(\theta), G(\theta)$ defined in terms of TD(λ) averages and stationary distributions.
Using Lyapunov analysis, it is shown that the system converges to an invariant set, specifically, a neighborhood of a local maximum of the average reward. The size of this neighborhood is controlled by the interaction of the step size gains ( $\Gamma_\eta$ , $\Gamma_w$ ) and the critic’s function approximation error.
This contrasts with two-timescale approaches, where, ideally, the critic “tracks” the actor perfectly so that convergence happens to the local optimum itself. In the single-timescale setting, the final iterate may suffer from persistent “dynamic error” proportional to the degree of function approximation and relative adaptation rates.

3. Temporal Difference Signal as a Unifying Mechanism

A salient feature of the single-timescale approach is that both the actor and critic updates are driven by the same TD signal:

$d(x, y, w) = r(x) - \eta(\theta) + h(y, w) - h(x, w)$

The critic update uses the TD error to minimize the projected Bellman equation via TD(λ).
The actor uses the TD error as an advantage-like signal, scaled by the policy gradient $\psi(x, u, \theta)$ , to adjust the policy.
The shared signal ensures that both modules operate coherently—functioning more like tightly coupled biological processes—and reduces the need for auxiliary baselines or multiple critics.

This tight coupling is both responsible for rapid empirical convergence and a theoretical source of the “dynamic error”, since neither subsystem is ever perfectly optimized with respect to the other (0909.2934).

4. Linear Function Approximation and Error Bounds

The critic employs linear function approximation $h(x, w) = \phi(x)^T w$ and optimizes the squared difference to the (unknown) true differential value $h(x, \theta)$ via the cost:

$I(w, \theta) = \frac{1}{2} \| h(\theta) - \Phi w \|^2_{\Pi(\theta)}$

where $\Pi(\theta)$ denotes the stationary distribution induced by the current policy. The error incurred due to linear approximation ( $\varepsilon_\mathrm{app}$ ) directly influences:

The width of the invariant set (distance to the local optimum),
The residual performance gap,
The tightness of the Lyapunov stability region proved in the ODE analysis.

Precise bounds and convergence neighborhoods are thus determined via the expressiveness of the critic’s features and explicitly via the parameter $\Gamma_w$ (gain), which can be tuned to reduce—but not eliminate—the final error ball.

5. Biological and Computational Relevance

A key motivation for single-timescale actor-critic is its potential for modeling biological reinforcement learning. Empirical findings in neuroscience suggest that separate neural populations for policy and value learning do not operate at grossly separated timescales; rather, both respond on similar times, and phasic dopamine activity may serve as a TD error signal communicated between cortical/striatal circuits (actor) and dopaminergic pathways (critic) (0909.2934).

From a computational perspective, removing inner loops and step-size management simplifies implementation, aligns with best practices in modern deep RL, and can result in more adaptive and responsive algorithms. However, the theoretical acceptability of only converging to a neighborhood (vs. a precise optimum) may be a tradeoff in highly sensitive applications.

6. Limitations and Research Directions

While the single-timescale method offers implementation and modeling advantages, current limitations include:

Global convergence to only a neighborhood of the optimum (unless step sizes are driven toward zero or one reverts to a two-timescale variant).
The critical dependence on function approximation error, which, if large, results in subpar performance regardless of further tuning.
Analytical error bounds in the ODE stability analysis remain somewhat loose; tighter constants and explicit convergence rates are identified as open problems.
Extensions to nonlinear function approximation (e.g., deep neural critics) are not covered in the original proof but are suggested as crucial for application to complex domains.
Integration of regularization, natural gradient approaches, or staged adaptation—starting with a single timescale for fast mixing then switching to separate schedules for final convergence—are proposed as promising improvements.

Future work will likely focus on reducing the dynamic error, extending convergence proofs to more general architectures, and refining the theoretical link between biological plausibility and algorithmic efficiency.

Summary Table: Key Features of the Single-Timescale Actor-Critic Algorithm

Feature	Description	Limitation/Tradeoff
Update Schedule	Actor and critic updated with same step size	Converges to neighborhood only
TD Error Usage	Both actor and critic share the same TD signal	Dynamic error persists
Critic Approximation	Linear (feature-based), TD(λ) learning	Sensitive to approximation error
Convergence Proof Technique	ODE analysis, Lyapunov argument	Loose error constants, no rates
Real-world Justification	Matches deep RL practice and biological models	Theoretical optimality unattainable
Parameter Sensitivity	Neighborhood size tuned by gain parameters	Requires tradeoff between speed/accuracy

This body of evidence substantiates the design, theoretical foundations, operational characteristics, and open challenges of single-timescale actor-critic algorithms as presented in the online temporal difference-based framework (0909.2934).

PDF Markdown Chat (Pro)

References (1)

A Convergent Online Single Time Scale Actor Critic Algorithm (2009)

Follow Topic

Get notified by email when new papers are published related to Single-Timescale Actor-Critic Algorithm.