Centralized-Critic Actor-Critic Methods

Updated 10 December 2025

Centralized-critic actor-critic methods are defined as a MARL approach where a global critic guides locally executed policies.
They utilize various architectures—state-based, history-based, or hybrid—to balance bias and variance in partially observable settings.
Empirical studies show that the critic design critically affects learning efficiency and policy optimality in diverse multi-agent tasks.

Centralized-critic actor-critic methods constitute a pivotal approach in multi-agent reinforcement learning (MARL), particularly under the Centralized Training with Decentralized Execution (CTDE) paradigm. In these frameworks, each agent’s policy (actor) is conditioned only on its own observation or history and operates independently at execution-time, while training leverages a centralized critic with access to global information, such as the true system state or the joint action-observation histories. This separation enables the exploitation of centralized information during training for accelerated and stabilized learning, while ensuring decentralized, scalable operation at test time. Despite their empirical popularity, recent theoretical and empirical findings indicate that centralized critics—especially those based solely on state information—can introduce nontrivial bias and variance in policy gradient estimates in partially observable environments, and their benefits are environment-dependent (Lyu et al., 2022, Lyu et al., 2024, Lyu et al., 2021).

1. Formal Framework for Centralized-Critic Actor-Critic Methods

The standard setting for centralized-critic actor-critic methods in MARL is the Decentralized Partially Observable Markov Decision Process (Dec-POMDP). At each timestep $t$ , the environment is in an unobserved global state $s_t \in \mathcal{S}$ ; each agent $i$ receives a private observation $o_{i,t}$ and executes action $a_{i,t}$ according to its decentralized policy $\pi_{\theta_i}(a_i \mid h_i)$ , where $h_i$ is the agent's local action-observation history. Typically, the agents collaboratively optimize a shared, discounted cumulative reward:

$J(\theta) = \mathbb{E}\left[ \sum_{t=0}^\infty \gamma^t r_t \right]$

where $\theta$ collects all actor parameters.

During centralized training, the critic is parameterized as either a value function $V_\varphi(s)$ or a state-action value function $Q_\varphi(s, \vec a)$ , having access to the full system state $s$ or joint histories. Actor update rules—such as those following the policy gradient theorem—use the critic to evaluate actions and compute advantages, often via estimators of the form:

$\nabla_{\theta_i} J(\theta) = \mathbb{E} \left[ \nabla_{\theta_i} \log \pi_{\theta_i}(a_i \mid h_i) \left(Q^\pi(s, \vec a) - b(s)\right) \right]$

where $b(s)$ is a baseline derived from the central critic (Lyu et al., 2022, Lyu et al., 2024).

2. Critic Architectures and Variants

Centralized critics can leverage various forms of available information, leading to several architecture variants:

Critic Variant	Input Features	Example Usage
State-based (SC)	Global state $s$	SMAC, particle worlds (deterministic or nearly fully observable) (Lyu et al., 2022, Lyu et al., 2024)
History-based (HC)	Joint histories $h$	Dec-Tiger, Box-Pushing (partial observability dominates) (Lyu et al., 2022, Lyu et al., 2024)
Joint-Observation	$(o_1,...,o_{\|I\|})$	Cooperative navigation (Lyu et al., 2022, Lyu et al., 2024)
Hybrid (HSC)	$(h, s, a)$	Combines information for improved bias/variance (Lyu et al., 2022)

Architecturally, SC typically employs an MLP embedding of $s$ ; HC applies RNNs to full histories; HSC concatenates RNN and global state embeddings to inform value estimation (Lyu et al., 2022).

3. Bias and Variance Analysis in Centralized-Critic Gradients

Substantial bias and variance effects emerge in the use of state-based centralized critics under partial observability:

Bias: When multiple distinct histories $h$ map onto the same state $s$ , but those histories yield different value-to-go, the expectation over $Q^\pi(s, a)$ —used in the policy gradient—does not match the correct history-based value $Q^\pi(h, a)$ . As established by rigorous analysis (Lyu et al., 2022, Lyu et al., 2024):

$Q^\pi(h, a) \neq \mathbb{E}_{s \sim \rho(s|h)} [ Q^\pi(s, a) ]$

leading to policy gradient bias.

Variance: Even when bias is eliminated (e.g., if histories in each state are value-equivalent), the state-based estimator's variance is provably greater or equal to the history-based estimator's variance due to the random mapping from histories to states (Lyu et al., 2022):

$\mathrm{Var}\left[ Q^\pi(s,a) \nabla \log \pi(a|h) \right] \ge \mathrm{Var}\left[ Q^\pi(h,a) \nabla \log \pi(a|h) \right]$

Empirical effects: Domains such as Dec-Tiger and Box-Pushing (with strong information-gathering requirements and deep history dependencies) expose detrimental bias and variance effects, often causing state-based critics to converge to suboptimal policies, whereas history- or hybrid-based critics yield better performance and stability (Lyu et al., 2022, Lyu et al., 2024).

4. Empirical Evidence and Environment Properties

Empirical comparisons indicate that the architectural choice of the critic must match the environment's observability and history structure:

Domain	Information Dependency	SC (State Critic)	HC (History Critic)	HSC (Hybrid)
SMAC (full/near-full obs)	Low (reactive optimal)	Best or tied	Tied	Tied
Meeting-in-a-Grid	Low	Fastest, optimal	Tied	Tied
Dec-Tiger, Box-Pushing, Cleaner	High (history critical)	Suboptimal	Outperforms SC	Best

On tasks like StarCraft Multi-Agent Challenge or cooperative navigation with minimal history dependence, state-based critics are effective and efficient. In contrast, in environments with necessary information-gathering or latent variable inference, only history- or hybrid-based critics avoid bias and attain optimal (or near-optimal) returns (Lyu et al., 2022, Lyu et al., 2024).

5. Centralized Critics in Asynchronous and Structured Multi-Agent Learning

The centralized-critic methodology is extended to domains that require asynchronous execution, temporally extended actions, or structured multi-agent cooperation. In asynchronous MARL, e.g., Mac-IAICC (Xiao et al., 2022), each agent executes macro-actions using its own decentralized actor but leverages a centralized, global-information critic at training time, with updates triggered by individual macro-action termination events. This design circumvents the need for agent synchronization and supports scalability in complex domains.

In structured domains, such as knowledge base editing (STACKFEED (Gupta et al., 2024)), a multi-actor, centralized-critic schema assigns each "document" to a separate actor, while a centralized critic decomposes global feedback into targeted per-actor updates. This design ensures credit assignment and mitigates nonstationarity among independently learning agents.

6. Design Guidelines, Practical Recommendations, and Limitations

The fundamental trade-offs summarized in recent analyses (Lyu et al., 2022, Lyu et al., 2024, Lyu et al., 2021) are as follows:

When to Favor State-based Critics (SC)
- Observability is strong (local observations are almost sufficient to reconstruct state).
- Optimal policies are reactive (history does not provide additional value).
- Accelerated representation learning via compact state representations.
When to Favor History-based Critics (HC) or Hybrid Critics (HSC)
- Strong partial observability, information-gathering, or long-horizon dependencies.
- Environments with latent state, hidden variables, or necessity for belief-state tracking.
Hybrid Approaches
- Combining state and history encodings or interpolating between $V(s)$ and $V(h)$ can reduce both bias and variance.
Potential Pitfalls
- Centralized critics are not universally beneficial; in some benchmarks, excessive variance or systematic bias in partial observability leads to degraded sample efficiency or failure to converge to optimal coordination.
- Scalability issues can arise in scenarios with large numbers of agents or high-dimensional state-action spaces.

It is empirically necessary to ablate critic varieties (SC, HC, HSC) on simplified instances of the target environment to estimate the degree to which histories beyond the current state impact optimal behavior and learning efficiency (Lyu et al., 2022).

7. Open Challenges and Future Directions

Despite their centrality in current MARL practice, centralized-critic actor-critic architectures remain an area of active research:

Systematic analysis of state-based critic bias and variance in large-scale, realistic partial observability scenarios is ongoing (Lyu et al., 2022, Lyu et al., 2024).
Advanced network architectures—e.g., attention mechanisms, structured critics (see (Garrido-Lestache et al., 30 Jul 2025))—are being developed to address the limitations of existing methods, including improved scalability, communication, and credit assignment.
Application-specific design, e.g., asynchronously updated critics or knowledge-editing with centralized critics (Xiao et al., 2022, Gupta et al., 2024), is expanding the operational capabilities and applicability of these methods.
The development and empirical validation of hybrid critics, importance-sampling corrections, and variance-reduction techniques remain priority open problems to close the gap between theory and large-scale, high-dimensional MARL practice (Lyu et al., 2022, Lyu et al., 2024).

Centralized-critic actor-critic methods are thus characterized by a nuanced set of trade-offs; their design and deployment require careful matching between agent architecture, critic structure, and the partial observability properties of the target domain.