Residual & Stackelberg Actor-Critic RL

Updated 16 March 2026

Residual and Stackelberg actor-critic are advanced RL methods that reformulate the actor-critic framework as a bilevel game, incorporating corrections for the critic’s best-response behavior.
The residual approach explicitly estimates the gradient gap via an auxiliary residual critic, ensuring convergence to the true policy gradient under ideal conditions.
The Stackelberg method employs hypergradient updates by differentiating through the critic’s optimal response, yielding stable convergence and robust performance even in high-dimensional settings.

Residual and Stackelberg actor-critic (AC) algorithms extend the standard actor-critic paradigm in reinforcement learning by adopting a game-theoretic bilevel or Stackelberg perspective on the actor–critic interaction, leading to new update rules, optimization frameworks, and theoretical properties. Both approaches introduce corrections to standard AC gradient methods: "residual" schemes estimate and add an explicit correction to the policy gradient, while "Stackelberg" or bilevel methods replace the actor’s update with the total derivative (hypergradient) that accounts for the critic’s response. This article provides a detailed exposition of these frameworks, their mathematical foundations, algorithmic instantiations, theoretical guarantees, and empirical performance.

1. Mathematical Formulation: Actor–Critic as Bilevel and Stackelberg Games

Actor-critic methods optimize two interlinked objectives: the actor (policy) parameters $\theta$ seek to maximize expected return, while the critic (value or $Q$ function parameters, $w$ or $q$ ) aims to provide accurate value estimates. The standard actor-critic update

$\theta_{k+1} = \theta_k + \alpha_\theta\, \nabla_\theta J(\theta_k, w_k), \qquad w_{k+1} = w_k - \alpha_w\, \nabla_w L(\theta_k, w_k)$

treats the actor and critic as simultaneously adapting, ignoring how the critic’s "best response" to the actor evolves.

The Stackelberg or bilevel view instead models actor and critic as sequential players in a general-sum game:

Actor as leader: $\max_\theta J(\theta, w^*(\theta))$ , where $w^*(\theta) = \arg\min_w L(\theta, w)$ .
Critic as follower: $\min_w L(\theta, w)$ .

By the implicit function theorem, the total derivative for the leader's objective is

$\frac{d}{d\theta} J(\theta, w^*(\theta)) = \nabla_\theta J(\theta, w) - \nabla_{w\theta}^\top L(\theta, w)\,[\nabla_w^2 L(\theta, w)]^{-1}\,\nabla_w J(\theta, w).$

This incorporates both the direct policy gradient and an implicit correction term reflecting the critic’s "reaction" to changes in $\theta$ (Zheng et al., 2021, Wen et al., 2021).

2. Residual Actor-Critic: Explicit Correction Between Actor-Critic and True Policy Gradient

Standard actor-critic methods do not, in general, recover the exact gradient of the expected cumulative reward due to the off-policy effect and critic approximation error. Wen et al. derive a closed-form expression for the gap between the actor-critic gradient and the true policy gradient:

$\nabla_\theta J(\theta) - \partial_\theta J_\pi = \partial_\theta E_{(s,a)\sim d_\theta}[\delta_{\theta, \phi}(s,a)]$

where $\delta_{\theta, \phi}(s,a)$ is the Bellman residual of the critic (Wen et al., 2021).

The residual actor-critic (Res-AC) algorithm addresses this by introducing a residual-critic network $w_\psi$ to estimate the value of the residual reward, then updating the actor according to

$\theta \leftarrow \theta + \alpha \left[ \nabla_\theta^{\phi} J + \nabla_\theta^{\psi} J_\delta \right]$

where $\nabla_\theta^\phi J$ is the standard actor-critic gradient and $\nabla_\theta^\psi J_\delta$ is the policy gradient computed from the residual-critic’s value function. This yields bias-free gradient updates in the limit of ideal critic and residual-critic tracking, and recovers the true policy gradient (Wen et al., 2021).

3. Stackelberg Actor-Critic and Bilevel Hypergradients

The Stackelberg AC framework applies the total derivative or "hypergradient" with respect to $\theta$ in the actor update. The update step for the actor (with critic as follower) is:

$\theta_{k+1} = \theta_k + \alpha_\theta \left[\nabla_\theta J(\theta_k, w_k) - \nabla_{w\theta}^\top L(\theta_k, w_k) [\nabla^2_w L(\theta_k, w_k)]^{-1} \nabla_w J(\theta_k, w_k)\right]$

while the critic update remains

$w_{k+1} = w_k - \alpha_w \nabla_w L(\theta_k, w_k).$

This "implicit gradient" preserves optimality in the bilevel optimization and provably converges to a (local) Stackelberg equilibrium under standard stochastic-approximation assumptions when appropriate step-size conditions and regularity hold (Zheng et al., 2021).

In the context of high-dimensional nonlinear critics, the computation of the Hessian-vector product required by the hypergradient is made tractable via accelerations such as the Nyström approximation (Prakash et al., 16 May 2025). The BLPO algorithm (bilevel policy optimization with Nyström hypergradients) leverages such techniques to maintain practical scalability and stability.

4. Penalty-Based Single-Loop and Residual-Corrected Stackelberg Actor-Critic

Traditional bilevel/Stackelberg optimization in RL typically uses nested loops, solving the lower-level problem (the critic’s best response) approximately for each actor update. Recent advances propose single-loop schemes with penalty or residual-based correction to avoid the inefficiencies of nested optimization. A prominent approach introduces a penalty-augmented surrogate objective:

$L_{w, \tau}(x, \pi) = f(x, \pi) + \frac{1}{w} (J_\tau(x, \pi_\tau^*(x)) - J_\tau(x, \pi))$

where $J_\tau$ is an entropy-regularized value, and $w$ , $\tau$ are annealed over training (Zeng et al., 23 Jan 2026). The policy residual is explicitly tracked to control the deviation between the current and "true" regularized policy without requiring full lower-level solution.

The resulting hyper-gradient for the upper-level update includes a residual correction:

$\nabla_x \Phi_{w,\tau}(x) = \nabla_x f(x, \pi_{w,\tau}^*(x)) + \frac{1}{w}\left[\nabla_x J_\tau(x,\pi_\tau^*(x)) - \nabla_x J_\tau(x, \pi_{w,\tau}^*(x))\right]$

Provable finite-time and finite-sample convergence rates are established under Polyak–Lojasiewicz conditions on the lower-level problem (Zeng et al., 23 Jan 2026).

5. Algorithm Comparison and Empirical Performance

Empirical evaluations in both tabular and continuous control domains reveal non-trivial trade-offs between vanilla actor-critic, residual, and Stackelberg actor-critic approaches. Key findings (Zheng et al., 2021, Wen et al., 2021):

Method	FourRoom	Pend	Reach	Chee
Actor_o–C	0.85	−1.7	−4.5	2000
Actor_g–C	1.00	−1.3	−3.8	2800
Stack–AC	1.10	−1.4	−3.7	3100
Res–AC	2.40	−0.8	−2.1	4200

Residual AC achieves notably superior sample efficiency and asymptotic return in the evaluated settings, especially when the critic is near-optimal. Stackelberg actor-critic demonstrates improved stability and convergence, particularly in problems with pronounced bilinear or quadratic structure, and is less sensitive to oscillatory dynamics that impede standard AC learning. In nonlinear/high-dimensional settings, effective hypergradient estimation (e.g., via the Nyström scheme in BLPO) is crucial for stable Stackelberg optimization (Prakash et al., 16 May 2025).

6. Extensions to Offline RL and Adversarial Stackelberg Formulations

Stackelberg and residual actor-critic principles have been successfully extended to offline RL settings under data-coverage limitations. The Adversarially Trained Actor Critic (ATAC) algorithm casts offline RL as a Stackelberg game where the actor maximizes relative data-consistent value and the critic adversarially induces worst-case evaluations subject to Bellman consistency (Cheng et al., 2022). This approach ensures robust policy improvement guarantees—ensuring that the learned policy will not underperform the behavior policy for a wide range of critic regularization strengths ( $\beta$ ). The Stackelberg structure enables direct finite-sample consistency proofs and competitiveness with any well-covered policy.

7. Theoretical Guarantees and Limitations

Stackelberg and residual actor-critic algorithms enjoy a suite of theoretical guarantees:

Exact correction to the policy gradient is obtained as critic and residual-critic converge (Res-AC).
Local (differential) Stackelberg equilibrium convergence established under stochastic-approximation regimes (Stackelberg AC) (Zheng et al., 2021).
In the presence of strong convexity/PL conditions, polynomial-time convergence to $\varepsilon$ -stationary points is established for BLPO and related methods (Prakash et al., 16 May 2025, Zeng et al., 23 Jan 2026).
Residual-based single-loop algorithms match the best sample complexities of nested approaches for strongly convex/PL lower-level problems (Zeng et al., 23 Jan 2026).

Limitations include reliance on accurate second-order (Hessian) information or efficient low-rank approximations for Stackelberg-type hypergradients, and the necessity of careful time-scale separation or penalty annealing for convergence in nonconvex or highly stochastic environments.

In summary, residual and Stackelberg actor-critic algorithms formalize and extend AC methods within a bilevel game-theoretic framework, supplying bias-free gradient corrections and hypergradient updates, and yielding provable stability and accelerated convergence via explicit accounting for actor–critic interdependence. Both methodologies have inspired further innovations in practical RL, including in first-order single-loop schemes, scalable bilevel optimization, robust offline RL, and the design of sample-efficient, stable deep RL algorithms (Wen et al., 2021, Zheng et al., 2021, Prakash et al., 16 May 2025, Zeng et al., 23 Jan 2026, Cheng et al., 2022).

Markdown Report Issue Upgrade to Chat

References (5)

Stackelberg Actor-Critic: Game-Theoretic Reinforcement Learning Algorithms (2021)

Characterizing the Gap Between Actor-Critic and Policy Gradient (2021)

Bi-Level Policy Optimization with Nyström Hypergradients (2025)

A Regularized Actor-Critic Algorithm for Bi-Level Reinforcement Learning (2026)

Adversarially Trained Actor Critic for Offline Reinforcement Learning (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Residual and Stackelberg Actor-Critic.

Residual & Stackelberg Actor-Critic RL

1. Mathematical Formulation: Actor–Critic as Bilevel and Stackelberg Games

2. Residual Actor-Critic: Explicit Correction Between Actor-Critic and True Policy Gradient

3. Stackelberg Actor-Critic and Bilevel Hypergradients

4. Penalty-Based Single-Loop and Residual-Corrected Stackelberg Actor-Critic

5. Algorithm Comparison and Empirical Performance

6. Extensions to Offline RL and Adversarial Stackelberg Formulations

7. Theoretical Guarantees and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Residual & Stackelberg Actor-Critic RL

1. Mathematical Formulation: Actor–Critic as Bilevel and Stackelberg Games

2. Residual Actor-Critic: Explicit Correction Between Actor-Critic and True Policy Gradient

3. Stackelberg Actor-Critic and Bilevel Hypergradients

4. Penalty-Based Single-Loop and Residual-Corrected Stackelberg Actor-Critic

5. Algorithm Comparison and Empirical Performance

6. Extensions to Offline RL and Adversarial Stackelberg Formulations

7. Theoretical Guarantees and Limitations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research