Asymmetric Actor-Critic Architecture

Updated 21 March 2026

Asymmetric actor-critic architecture is a reinforcement learning framework that equips the critic with privileged, high-fidelity inputs to improve policy evaluation and address partial observability.
It leverages differences in input, network capacity, and optimization roles to enhance training efficiency, sim-to-real transfer, and hierarchical learning.
Theoretical analyses and empirical results demonstrate faster convergence, reduced aliasing errors, and robust performance across various complex RL tasks.

An asymmetric actor-critic architecture is a reinforcement learning (RL) design principle in which the actor and critic modules are provided differing sources, modalities, or amounts of input information, and/or instantiated with different network structures or update rules. This contrasts with the conventional, symmetric actor-critic architectures, where both actor and critic operate on the same input and often share network encoders. Asymmetry can be leveraged to incorporate privileged or high-fidelity information into the critic (for example, full simulator state during training), utilize different representational capacities for actor and critic, or establish hierarchical leader–follower dynamics in the optimization process. Asymmetric actor-critic has strong theoretical and empirical motivation—in particular, it alleviates partial observability and aliasing errors, enhances training efficiency, and enables practical RL in complex or real-world domains.

1. Taxonomy and Key Instantiations

Asymmetric actor-critic encompasses several architectural and algorithmic instantiations:

Input Asymmetry (Privileged Critic Information): The critic is given privileged or high-fidelity observations unavailable to the actor at deployment, such as simulator state versus agent observations, or environment context variables during training in transfer learning scenarios. Notable implementations include asymmetric state-based critics in sim-to-real RL (Pinto et al., 2017), contextual critics in adaptive RL (Yue et al., 2022), and history–state critics in POMDPs (Baisero et al., 2021).
Capacity Asymmetry: The actor and critic are assigned neural networks of different sizes or expressiveness. For example, employing a smaller (low-latency) actor with a large, overparameterized critic (Mastikhina et al., 1 Jun 2025).
Representation Asymmetry (Decoupled or Specialized Encoders): The actor and critic maintain separate representation pipelines, enabling specialisation—actor encodings concentrate on action-relevant features, while critic encodings capture value and transition information (Garcin et al., 8 Mar 2025).
Optimization Role Asymmetry (Game-Theoretic Hierarchy): The actor and critic are treated as leader and follower in a Stackelberg or saddle-point optimization—e.g., Stackelberg actor-critic (Zheng et al., 2021), and Dual Actor-Critic with primal-dual Bellman objectives (Dai et al., 2017).
Information Level Asymmetry: Critics leverage not full ground-truth state but arbitrary privileged signals—generalizing the asymmetric paradigm to settings with partial, noisy or structured side-information (Ebi et al., 30 Sep 2025).

2. Privileged Information in Critics and Partial Observability

A canonical scenario for input asymmetry arises in environments with partial observability (POMDPs). During training, the critic can exploit privileged simulator state while the actor, constrained to observation histories, remains deployment-ready. Formally, the critic’s value function becomes $V(h, s)$ , where $h$ is the agent-observed history and $s$ is the true latent state, whereas standard actor-critic would use only $V(h)$ or the ill-defined $V(s)$ under partial observability.

The “unbiased asymmetric actor-critic” (AA2C-HS) demonstrates that augmenting the critic with privileged state maintains an unbiased policy gradient and variance reduction. The policy-gradient theorem is preserved by estimating advantages with the history–state critic:

$\hat{g} = -\sum_{t=0}^{T-1} \gamma^t Q^{\pi}(h_t, s_t, a_t) \nabla_\theta \log \pi_\theta(a_t | h_t)$

and variance can be further reduced using TD error baselines (Baisero et al., 2021). Theoretical analysis shows that such asymmetric critics do not introduce bias in the policy gradient, unlike naïve state-critic variants. Empirical studies indicate faster and higher-quality policy learning—especially on hard POMDPs—compared to symmetric or biased asymmetric approaches.

Recent generalizations relax the assumption of full-state access: "informed asymmetric actor-critic" conditions the critic on arbitrary privileged signals, not just state, and provides information-theoretic diagnostics to assess the informativeness and expected benefit of signals provided to the critic (Ebi et al., 30 Sep 2025).

3. Game-Theoretic and Optimization Perspectives

Asymmetric architectures also arise by re-casting actor-critic optimization as a hierarchical, game-theoretic process:

Stackelberg Actor-Critic: The actor (leader) chooses its parameters θ, anticipating the best-response mapping $w^*(\theta)$ of the critic (follower), which solves its own loss given the actor’s strategy. The actor therefore optimizes $J(\theta, w^*(\theta))$ , accounting for the implicit dependence of the critic:

$\frac{dJ}{d\theta} = \nabla_\theta J(\theta, w) + \left( \frac{\partial w^*}{\partial \theta} \right)^T \nabla_w J(\theta, w)$

where $\partial w^* / \partial \theta$ is computed via implicit differentiation of the critic’s optimality condition. This Stackelberg (leader–follower) update is more faithful than standard simultaneous gradient methods, eliminates adversarial learning cycles, and accelerates convergence to equilibrium (Zheng et al., 2021).

Dual Actor-Critic: Framed via the Lagrangian dual of the Bellman equation, the critic (dual) minimizes a Lagrangian, while the actor (primal) maximizes over policy occupancy. Both modules are updated within a single saddle-point optimization, as opposed to alternating separate objectives. This coordinated, asymmetric interaction delivers unbiased gradients, clearer theoretical guarantees, and improved empirical performance (Dai et al., 2017).

4. Architectural Asymmetry: Capacity and Representation

Empirical RL pipelines frequently instantiate actors and critics with different network architectures. Motivations include reducing inference costs for the actor, or maximizing critic expressivity for stable value estimation:

Small Actor, Large Critic: Reducing the actor size (number of parameters, hidden units) relative to the critic can accelerate inference and deployment. However, empirical evidence indicates that naïvely shrinking the actor leads to performance drop and critic overfitting, particularly due to data collection pathologies—conservative policies produce low-quality experience, induce Q-value underestimation, and drive poor exploration (Mastikhina et al., 1 Jun 2025). Performance can be largely recovered by correcting value target bias (e.g., replacing SAC's $\min$ -target with mean or max aggregation), thereby compensating for pessimism induced by the combination of small actors and under-explored state space.
Decoupled Representation Pipelines: With separate actor and critic encoders, both networks specialize: actor encoders tend to extract action-relevant features, while critic encoders represent value and transition dynamics. Decoupling, as opposed to parameter sharing, yields measurable improvements: increased mutual information between critic latents and value, reduced overfitting in actor latents, improved sample efficiency, and greater policy generalization (Garcin et al., 8 Mar 2025). Critically, the separate critic pipeline also implicitly shapes exploration by assigning meaningful advantage signals to under-explored or value-uncertain states.

5. Asymmetric Actor-Critic in Domain Adaptation and Transfer

The asymmetric paradigm is fundamental in heterogeneous training/deployment regimes, such as sim-to-real transfer and generalization to novel environments:

Sim-to-real Robotic RL: In simulation, the critic can be conditioned on ground-truth state while the actor is restricted to raw images (or other partial sensory signals). Empirical evidence in robotic manipulation and navigation tasks reveals that this design, especially when combined with data augmentation (domain randomization), yields superior sample efficiency and successful real-world transfer—learned policies are robust to noise, distractors, and calibration errors, which cannot be efficiently handled by fully symmetric architectures (Pinto et al., 2017).
Contextual RL and Environmental Generalization: In domain adaptation contexts (CMDPs), the critic is given access to environment context variables (e.g., physical parameters, wind, friction) during training to learn more precise value estimates and advantage shaping, while the actor is constrained to raw sensory observations—this separation is essential for robust zero-shot generalization to unseen test distributions. The "AACC" (Asymmetric Actor-Critic in Contextual RL) architecture demonstrates significant gains over domain randomization and symmetric-context baselines (Yue et al., 2022).

6. Theoretical Guarantees and Convergence

Asymmetric actor-critic methods have robust theoretical foundations under both tabular and function approximation regimes:

Elimination of Aliasing Error: Classic symmetric critics in partially observable environments suffer from "aliasing": different true states map to identical observed-agent states, injecting irreducible error in value approximation and resulting policy gradients. The asymmetric critic, by conditioning on privileged state during training, provably eliminates this aliasing term. Finite-time error analysis demonstrates that the TD fitting error bound for the asymmetric critic lacks the aliasing error present in the symmetric case (Lambrechts et al., 31 Jan 2025). This guarantees cleaner, faster convergence in the presence of partial observability.
Unbiasedness and Variance: Several works prove that augmenting the critic $V(h, s)$ or $Q(h, i, a)$ does not introduce bias into the policy gradient, provided the advantage estimates are constructed appropriately (Baisero et al., 2021, Ebi et al., 30 Sep 2025). For privileged signals beyond pure state (for example, partial "distance to goal" cues or other task-relevant information), informativeness diagnostics can quantify the expected benefit—these diagnostics have been empirically linked to improved learning efficiency (Ebi et al., 30 Sep 2025).
Game-Theoretic and Saddle-Point Convergence: Leader–follower updates in Stackelberg and saddle-point dual actor-critic methods admit local convergence guarantees under two-timescale stochastic approximation theory and convexity assumptions (Zheng et al., 2021, Dai et al., 2017).

7. Extensions: Multiple Critics and Model-Based Signals

Further generalizations of asymmetric actor-critic architectures incorporate multiple critics endowed with heterogeneous sources of value information:

Model-Based Critic Integration: The "actor-critic-2" framework fuses a reward-based critic (model-free, long-term credit assignment) with a model-based critic grounded on potential-field heuristics (encodes geometric or task prior). The actor update adaptively interpolates between these critics using state-dependent weights, biasing learning toward guided exploration in high-potential regions (e.g., far from goal or near obstacles), and reverting to standard policy gradients in ambiguous states. This multiple-critic asymmetric fusion accelerates convergence, especially in tasks with strong structural prior but requires mechanisms for weight interpolation and variance balancing (Ren, 2020).

In summary, the asymmetric actor-critic architecture leverages principled discrepancies—either in input information, network capacity, or optimization dynamics—between actor and critic to improve sample efficiency, generalization, and learning stability in RL. Both theoretical analyses and empirical benchmarks substantiate the benefits of this paradigm across POMDPs, sim-to-real transfer, domain adaptation, and hierarchical RL settings. Prominent lines of research are exploring how to optimally select and encode privileged signals for the critic, design adaptive fusion of multiple evaluators, and further exploit game-theoretic formulations for robust RL in complex environments.