Natural Critic-Actor Algorithm in CMDPs

Updated 12 October 2025

Natural Critic-Actor Algorithm is an online reinforcement learning method that reverses conventional update timescales and leverages natural policy gradients for CMDPs.
It employs a three-timescale structure with linear value approximation, average cost estimation, and Lagrange multiplier updates to ensure both stability and constraint satisfaction.
The algorithm achieves improved non-asymptotic sample complexity and robust performance on safety-critical tasks compared to traditional actor-critic frameworks.

The Natural Critic-Actor Algorithm is a class of online reinforcement learning algorithms for long-run average cost constrained Markov decision processes (CMDPs) that innovates on the standard actor-critic architecture by reversing the conventional update timescale ordering and incorporating natural policy gradient methods. Employing linear function approximation, online average cost estimation, and Lagrange multiplier updates for inequality constraints, this algorithm exhibits a three-timescale "critic-actor" structure and achieves improved non-asymptotic sample complexity while ensuring stability and constraint satisfaction. The design supports practical deployment in large-scale, safety-critical or constraint-dominated RL domains.

1. Algorithm Structure and Key Features

The Constrained Natural Critic-Actor (C–NCA) algorithm proceeds at each step by updating four sets of variables:

Average cost estimate $L_n$ and estimates $U_{k, n}$ for each constraint cost,
Critic parameter $v_n \in \mathbb{R}^d$ (for the linear function approximator of the differential value function),
Actor parameter $\theta_n \in \mathbb{R}^{d_\theta}$ (parameterizing the stochastic policy),
Lagrange multipliers $\gamma_n \in \mathbb{R}^m$ (for $m$ inequality constraints).

The distinguishing element is the critic–actor timescale reversal: the actor (policy) parameters $\theta_n$ receive the fastest updates (same rates as average cost estimation), followed by the critic $v_n$ on an intermediate scale, and finally the Lagrange multipliers $\gamma_n$ on the slowest scale. This ordering is formalized by diminishing step-size sequences $a(n)$ (actor and average cost, e.g., $a(n) \sim n^{-\nu}$ , $\nu > 0.5$ ), $b(n)$ (critic, with $b(n)$ slower than $a(n)$ ), and $c(n)$ (multiplier, with $c(n)$ slowest).

Update Rules (primary recursions)

Critic update:

$v_{n+1} = \Gamma \left( v_n + b(n)\,\delta_n\,\phi(s_n) \right)$

where $\Gamma$ projects to a compact set, $\phi(s)$ are fixed features, and $\delta_n$ is the temporally consistent TD error:

$\delta_n = r(s_n, a_n, \gamma_n) - L_n + v_n^\top (\phi(s_{n+1}) - \phi(s_n)).$

Actor update (natural gradient):

$\theta_{n+1} = \theta_n + a(n)\,\delta_n\,G(n)^{-1}\Psi(s_n, a_n)$

with $\Psi(s,a) = \nabla_\theta \log \pi_\theta(a|s)$ and $G(n)$ the Fisher information matrix estimate:

$G(n+1) = (1 - a(n))G(n) + a(n) \Psi(s_n, a_n)\Psi(s_n, a_n)^\top.$

Average cost and constraint cost update:

$L_{n+1} = L_n + d(n) [ r(s_n, a_n, \gamma_n) - L_n ].$

Analogous updates for each constraint cost estimate $U_{k,n}$ .

Lagrange multiplier:

$\gamma_{k, n+1} = \hat{\Gamma} \left( \gamma_{k, n} + c(n)[ U_{k, n} - \alpha_k ] \right)$

with $\hat{\Gamma}$ a projection operator.

The combination of these recursions forms a multi-timescale stochastic approximation system in which the actor's natural gradient increment leverages the curvature of the policy manifold for greater robustness, and the critic's TD error is computed with respect to a linear value-function approximator using features.

2. Function Approximation and Compatibility

The algorithm adopts linear function approximation both for the differential value function (critic) and for computing the compatible features needed in the natural gradient update:

Differential value function: $V(s) \approx v^\top\phi(s)$ where $\phi(s)\in\mathbb{R}^d$ .
Compatible features: Actor updates use $\Psi(s,a)=\nabla_\theta\log\pi_\theta(a|s)$ , aligning the function class of the advantage estimate with the policy gradient.

This compatibility is critical: it ensures the natural gradient step direction aligns with the true policy gradient under certain regularity conditions, yields unbiased updates, and underpins the theoretical analysis of both convergence and constraint satisfaction.

3. Convergence and Sample Complexity Analysis

A rigorous non-asymptotic analysis is provided for the convergence behavior of each component:

Critic: For step-size choices respecting the prescribed timescale separation (e.g., $a(n)\sim n^{-\nu}$ , $b(n)\sim n^{-\sigma}$ , $c(n)\sim n^{-\beta}$ with $0 < \nu < \sigma < \beta < 1$ ), it is established that the average squared error in the critic (parameter $z_n = v_n - v^*(\theta_n, \gamma_n)$ ) admits the bound:

$\frac{1}{t-\tau} \sum_{k=\tau}^{t} \mathbb{E}\|z_k\|^2 = O\left( \log^2 t \cdot t^{2\sigma-2\nu}/t \right) + \ldots$

Actor and Average Cost: For similar scaling, the error in the policy and the average cost estimate $\mathbb{E}[(L_k-L(\theta_k, \gamma_k))^2]$ is shown to decay at optimized rates (e.g., $O(\log^2 t \cdot t^{-\nu}) + O(t^{\nu-\beta})$ ).
Sample complexity improvement: By modifying the step-sizes with a $\sqrt{\log(t+1)}$ factor (for example, $a(n) = \sqrt{\log(n+1)}/n^{\nu}$ ), the sample complexity of driving the mean-squared critic error below $\varepsilon$ is $T = \tilde{O}(\varepsilon^{-2})$ , matching that of the best known single-timescale unconstrained approaches and improving on prior multi-timescale analyses.
Constraint satisfaction: The projected multiplier recursions are proven to ensure, with high probability, that long-run average constraints are respected, i.e., the learned policy satisfies $\mathbb{E}[U_{k,n}] \leq \alpha_k$ after sufficient iterations.

The C–NCA algorithm generalizes both the two-timescale critic-actor (CA) architecture (where the policy is updated at a faster rate than the value function, emulating value iteration) and the standard actor-critic (AC) approach (where the critic tracks a slow policy). The multi-timescale ordering used here inherits the critic-actor stability properties established for unconstrained discounted and average reward problems (Bhatnagar et al., 2022).

Distinguishing features compared to actor-critic include:

Stability of natural gradient updates for fast policy improvement,
Lower practical complexity for policy parameter tuning,
Superior or matched sample complexity with flexible performance through step-size modification,
Capacity for effective handling of safety or inequality constraints in CMDPs.

5. Empirical Performance and Safety-Critical Applications

Empirical evaluation on Safety–Gym benchmarks (including SafetyAntCircle1–v0, SafetyCarGoal1–v0, and SafetyPointPush1–v0) demonstrates that the C–NCA (especially the "Modified" version employing the improved step-size schedule) is competitive with, and in several cases outperforms, state-of-the-art constrained RL algorithms (C–AC, C–NAC, C–CA, C–DQN, etc.) in terms of maximizing average reward while maintaining rigorous satisfaction of constraint thresholds. These experiments validate the theoretical findings and reinforce the utility of natural critic–actor approaches for constraint-dominated, safety-oriented RL settings.

6. Mathematical Formulation and Algorithmic Details

A representative subset of the algorithm's update rules is: $\begin{aligned} &\text{Critic:} && v_{n+1} = \Gamma \left( v_n + b(n)\,\delta_n\,\phi(s_n) \right) \ &\text{TD error:} && \delta_n = r(s_n, a_n, \gamma_n) - L_n + v_n^\top (\phi(s_{n+1}) - \phi(s_n)) \ &\text{Actor (natural gradient):} && \theta_{n+1} = \theta_n + a(n)\,\delta_n\,G(n)^{-1}\Psi(s_n,a_n) \ &\text{Fisher update:} && G(n+1) = (1 - a(n)) G(n) + a(n) \Psi(s_n,a_n)\Psi(s_n,a_n)^\top \ &\text{Average cost:} && L_{n+1} = L_n + d(n)[r(s_n,a_n,\gamma_n) - L_n] \ &\text{Constraint cost:} && U_{k,n+1} = U_{k,n} + d(n)[h_k(s_n,a_n) - U_{k,n}] \ &\text{Multiplier:} && \gamma_{k,n+1} = \hat{\Gamma}(\gamma_{k,n} + c(n)[U_{k,n} - \alpha_k]) \end{aligned}$ where $\hat{\Gamma}$ denotes the projection onto $[0, M]$ for bounded multipliers, $b(n)$ and $d(n)$ are intermediate-step sizes, and $a(n)$ , $c(n)$ are fast and slow step sizes respectively.

7. Significance and Future Directions

The Natural Critic-Actor algorithm with constraint-handling and function approximation advances reinforcement learning theory by:

Establishing the first non-asymptotic convergence and sample complexity guarantees for the critic-actor order with natural policy gradient in constrained average cost MDPs,
Demonstrating that step-size refinement can achieve nearly optimal $\tilde{O}(\varepsilon^{-2})$ sample complexity, a key milestone for RL algorithms with complex constraints,
Confirming, via benchmark tasks, that the algorithm delivers stable and competitive performance alongside strict constraint satisfaction.

Potential extensions include exploring nonlinear function approximation (e.g., neural network critics), application to large-scale multi-agent systems, and further variance reduction techniques to relax technical conditions on Markovian noise or step-size scheduling. The approach may serve as a foundation for scalable, deployable RL in high-stakes domains where constraint satisfaction and safety are paramount.

PDF Markdown Chat (Pro)

References (1)

Actor-Critic or Critic-Actor? A Tale of Two Time Scales (2022)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Natural Critic-Actor Algorithm.