Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 147 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 96 tok/s Pro
Kimi K2 188 tok/s Pro
GPT OSS 120B 398 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Natural Critic-Actor Algorithm in CMDPs

Updated 12 October 2025
  • Natural Critic-Actor Algorithm is an online reinforcement learning method that reverses conventional update timescales and leverages natural policy gradients for CMDPs.
  • It employs a three-timescale structure with linear value approximation, average cost estimation, and Lagrange multiplier updates to ensure both stability and constraint satisfaction.
  • The algorithm achieves improved non-asymptotic sample complexity and robust performance on safety-critical tasks compared to traditional actor-critic frameworks.

The Natural Critic-Actor Algorithm is a class of online reinforcement learning algorithms for long-run average cost constrained Markov decision processes (CMDPs) that innovates on the standard actor-critic architecture by reversing the conventional update timescale ordering and incorporating natural policy gradient methods. Employing linear function approximation, online average cost estimation, and Lagrange multiplier updates for inequality constraints, this algorithm exhibits a three-timescale "critic-actor" structure and achieves improved non-asymptotic sample complexity while ensuring stability and constraint satisfaction. The design supports practical deployment in large-scale, safety-critical or constraint-dominated RL domains.

1. Algorithm Structure and Key Features

The Constrained Natural Critic-Actor (C–NCA) algorithm proceeds at each step by updating four sets of variables:

  • Average cost estimate LnL_n and estimates Uk,nU_{k, n} for each constraint cost,
  • Critic parameter vnRdv_n \in \mathbb{R}^d (for the linear function approximator of the differential value function),
  • Actor parameter θnRdθ\theta_n \in \mathbb{R}^{d_\theta} (parameterizing the stochastic policy),
  • Lagrange multipliers γnRm\gamma_n \in \mathbb{R}^m (for mm inequality constraints).

The distinguishing element is the critic–actor timescale reversal: the actor (policy) parameters θn\theta_n receive the fastest updates (same rates as average cost estimation), followed by the critic vnv_n on an intermediate scale, and finally the Lagrange multipliers γn\gamma_n on the slowest scale. This ordering is formalized by diminishing step-size sequences a(n)a(n) (actor and average cost, e.g., a(n)nνa(n) \sim n^{-\nu}, ν>0.5\nu > 0.5), b(n)b(n) (critic, with b(n)b(n) slower than a(n)a(n)), and c(n)c(n) (multiplier, with c(n)c(n) slowest).

Update Rules (primary recursions)

  • Critic update:

vn+1=Γ(vn+b(n)δnϕ(sn))v_{n+1} = \Gamma \left( v_n + b(n)\,\delta_n\,\phi(s_n) \right)

where Γ\Gamma projects to a compact set, ϕ(s)\phi(s) are fixed features, and δn\delta_n is the temporally consistent TD error:

δn=r(sn,an,γn)Ln+vn(ϕ(sn+1)ϕ(sn)).\delta_n = r(s_n, a_n, \gamma_n) - L_n + v_n^\top (\phi(s_{n+1}) - \phi(s_n)).

  • Actor update (natural gradient):

θn+1=θn+a(n)δnG(n)1Ψ(sn,an)\theta_{n+1} = \theta_n + a(n)\,\delta_n\,G(n)^{-1}\Psi(s_n, a_n)

with Ψ(s,a)=θlogπθ(as)\Psi(s,a) = \nabla_\theta \log \pi_\theta(a|s) and G(n)G(n) the Fisher information matrix estimate:

G(n+1)=(1a(n))G(n)+a(n)Ψ(sn,an)Ψ(sn,an).G(n+1) = (1 - a(n))G(n) + a(n) \Psi(s_n, a_n)\Psi(s_n, a_n)^\top.

  • Average cost and constraint cost update:

Ln+1=Ln+d(n)[r(sn,an,γn)Ln].L_{n+1} = L_n + d(n) [ r(s_n, a_n, \gamma_n) - L_n ].

Analogous updates for each constraint cost estimate Uk,nU_{k,n}.

  • Lagrange multiplier:

γk,n+1=Γ^(γk,n+c(n)[Uk,nαk])\gamma_{k, n+1} = \hat{\Gamma} \left( \gamma_{k, n} + c(n)[ U_{k, n} - \alpha_k ] \right)

with Γ^\hat{\Gamma} a projection operator.

The combination of these recursions forms a multi-timescale stochastic approximation system in which the actor's natural gradient increment leverages the curvature of the policy manifold for greater robustness, and the critic's TD error is computed with respect to a linear value-function approximator using features.

2. Function Approximation and Compatibility

The algorithm adopts linear function approximation both for the differential value function (critic) and for computing the compatible features needed in the natural gradient update:

  • Differential value function: V(s)vϕ(s)V(s) \approx v^\top\phi(s) where ϕ(s)Rd\phi(s)\in\mathbb{R}^d.
  • Compatible features: Actor updates use Ψ(s,a)=θlogπθ(as)\Psi(s,a)=\nabla_\theta\log\pi_\theta(a|s), aligning the function class of the advantage estimate with the policy gradient.

This compatibility is critical: it ensures the natural gradient step direction aligns with the true policy gradient under certain regularity conditions, yields unbiased updates, and underpins the theoretical analysis of both convergence and constraint satisfaction.

3. Convergence and Sample Complexity Analysis

A rigorous non-asymptotic analysis is provided for the convergence behavior of each component:

  • Critic: For step-size choices respecting the prescribed timescale separation (e.g., a(n)nνa(n)\sim n^{-\nu}, b(n)nσb(n)\sim n^{-\sigma}, c(n)nβc(n)\sim n^{-\beta} with 0<ν<σ<β<10 < \nu < \sigma < \beta < 1), it is established that the average squared error in the critic (parameter zn=vnv(θn,γn)z_n = v_n - v^*(\theta_n, \gamma_n)) admits the bound:

1tτk=τtEzk2=O(log2tt2σ2ν/t)+\frac{1}{t-\tau} \sum_{k=\tau}^{t} \mathbb{E}\|z_k\|^2 = O\left( \log^2 t \cdot t^{2\sigma-2\nu}/t \right) + \ldots

  • Actor and Average Cost: For similar scaling, the error in the policy and the average cost estimate E[(LkL(θk,γk))2]\mathbb{E}[(L_k-L(\theta_k, \gamma_k))^2] is shown to decay at optimized rates (e.g., O(log2ttν)+O(tνβ)O(\log^2 t \cdot t^{-\nu}) + O(t^{\nu-\beta})).
  • Sample complexity improvement: By modifying the step-sizes with a log(t+1)\sqrt{\log(t+1)} factor (for example, a(n)=log(n+1)/nνa(n) = \sqrt{\log(n+1)}/n^{\nu}), the sample complexity of driving the mean-squared critic error below ε\varepsilon is T=O~(ε2)T = \tilde{O}(\varepsilon^{-2}), matching that of the best known single-timescale unconstrained approaches and improving on prior multi-timescale analyses.
  • Constraint satisfaction: The projected multiplier recursions are proven to ensure, with high probability, that long-run average constraints are respected, i.e., the learned policy satisfies E[Uk,n]αk\mathbb{E}[U_{k,n}] \leq \alpha_k after sufficient iterations.

The C–NCA algorithm generalizes both the two-timescale critic-actor (CA) architecture (where the policy is updated at a faster rate than the value function, emulating value iteration) and the standard actor-critic (AC) approach (where the critic tracks a slow policy). The multi-timescale ordering used here inherits the critic-actor stability properties established for unconstrained discounted and average reward problems (Bhatnagar et al., 2022).

Distinguishing features compared to actor-critic include:

  • Stability of natural gradient updates for fast policy improvement,
  • Lower practical complexity for policy parameter tuning,
  • Superior or matched sample complexity with flexible performance through step-size modification,
  • Capacity for effective handling of safety or inequality constraints in CMDPs.

5. Empirical Performance and Safety-Critical Applications

Empirical evaluation on Safety–Gym benchmarks (including SafetyAntCircle1–v0, SafetyCarGoal1–v0, and SafetyPointPush1–v0) demonstrates that the C–NCA (especially the "Modified" version employing the improved step-size schedule) is competitive with, and in several cases outperforms, state-of-the-art constrained RL algorithms (C–AC, C–NAC, C–CA, C–DQN, etc.) in terms of maximizing average reward while maintaining rigorous satisfaction of constraint thresholds. These experiments validate the theoretical findings and reinforce the utility of natural critic–actor approaches for constraint-dominated, safety-oriented RL settings.

6. Mathematical Formulation and Algorithmic Details

A representative subset of the algorithm's update rules is: Critic:vn+1=Γ(vn+b(n)δnϕ(sn)) TD error:δn=r(sn,an,γn)Ln+vn(ϕ(sn+1)ϕ(sn)) Actor (natural gradient):θn+1=θn+a(n)δnG(n)1Ψ(sn,an) Fisher update:G(n+1)=(1a(n))G(n)+a(n)Ψ(sn,an)Ψ(sn,an) Average cost:Ln+1=Ln+d(n)[r(sn,an,γn)Ln] Constraint cost:Uk,n+1=Uk,n+d(n)[hk(sn,an)Uk,n] Multiplier:γk,n+1=Γ^(γk,n+c(n)[Uk,nαk])\begin{aligned} &\text{Critic:} && v_{n+1} = \Gamma \left( v_n + b(n)\,\delta_n\,\phi(s_n) \right) \ &\text{TD error:} && \delta_n = r(s_n, a_n, \gamma_n) - L_n + v_n^\top (\phi(s_{n+1}) - \phi(s_n)) \ &\text{Actor (natural gradient):} && \theta_{n+1} = \theta_n + a(n)\,\delta_n\,G(n)^{-1}\Psi(s_n,a_n) \ &\text{Fisher update:} && G(n+1) = (1 - a(n)) G(n) + a(n) \Psi(s_n,a_n)\Psi(s_n,a_n)^\top \ &\text{Average cost:} && L_{n+1} = L_n + d(n)[r(s_n,a_n,\gamma_n) - L_n] \ &\text{Constraint cost:} && U_{k,n+1} = U_{k,n} + d(n)[h_k(s_n,a_n) - U_{k,n}] \ &\text{Multiplier:} && \gamma_{k,n+1} = \hat{\Gamma}(\gamma_{k,n} + c(n)[U_{k,n} - \alpha_k]) \end{aligned} where Γ^\hat{\Gamma} denotes the projection onto [0,M][0, M] for bounded multipliers, b(n)b(n) and d(n)d(n) are intermediate-step sizes, and a(n)a(n), c(n)c(n) are fast and slow step sizes respectively.

7. Significance and Future Directions

The Natural Critic-Actor algorithm with constraint-handling and function approximation advances reinforcement learning theory by:

  • Establishing the first non-asymptotic convergence and sample complexity guarantees for the critic-actor order with natural policy gradient in constrained average cost MDPs,
  • Demonstrating that step-size refinement can achieve nearly optimal O~(ε2)\tilde{O}(\varepsilon^{-2}) sample complexity, a key milestone for RL algorithms with complex constraints,
  • Confirming, via benchmark tasks, that the algorithm delivers stable and competitive performance alongside strict constraint satisfaction.

Potential extensions include exploring nonlinear function approximation (e.g., neural network critics), application to large-scale multi-agent systems, and further variance reduction techniques to relax technical conditions on Markovian noise or step-size scheduling. The approach may serve as a foundation for scalable, deployable RL in high-stakes domains where constraint satisfaction and safety are paramount.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Natural Critic-Actor Algorithm.