Papers
Topics
Authors
Recent
2000 character limit reached

Centralized-Critic Actor-Critic Methods

Updated 10 December 2025
  • Centralized-critic actor-critic methods are defined as a MARL approach where a global critic guides locally executed policies.
  • They utilize various architectures—state-based, history-based, or hybrid—to balance bias and variance in partially observable settings.
  • Empirical studies show that the critic design critically affects learning efficiency and policy optimality in diverse multi-agent tasks.

Centralized-critic actor-critic methods constitute a pivotal approach in multi-agent reinforcement learning (MARL), particularly under the Centralized Training with Decentralized Execution (CTDE) paradigm. In these frameworks, each agent’s policy (actor) is conditioned only on its own observation or history and operates independently at execution-time, while training leverages a centralized critic with access to global information, such as the true system state or the joint action-observation histories. This separation enables the exploitation of centralized information during training for accelerated and stabilized learning, while ensuring decentralized, scalable operation at test time. Despite their empirical popularity, recent theoretical and empirical findings indicate that centralized critics—especially those based solely on state information—can introduce nontrivial bias and variance in policy gradient estimates in partially observable environments, and their benefits are environment-dependent (Lyu et al., 2022, Lyu et al., 26 Aug 2024, Lyu et al., 2021).

1. Formal Framework for Centralized-Critic Actor-Critic Methods

The standard setting for centralized-critic actor-critic methods in MARL is the Decentralized Partially Observable Markov Decision Process (Dec-POMDP). At each timestep tt, the environment is in an unobserved global state stSs_t \in \mathcal{S}; each agent ii receives a private observation oi,to_{i,t} and executes action ai,ta_{i,t} according to its decentralized policy πθi(aihi)\pi_{\theta_i}(a_i \mid h_i), where hih_i is the agent's local action-observation history. Typically, the agents collaboratively optimize a shared, discounted cumulative reward:

J(θ)=E[t=0γtrt]J(\theta) = \mathbb{E}\left[ \sum_{t=0}^\infty \gamma^t r_t \right]

where θ\theta collects all actor parameters.

During centralized training, the critic is parameterized as either a value function Vφ(s)V_\varphi(s) or a state-action value function Qφ(s,a)Q_\varphi(s, \vec a), having access to the full system state ss or joint histories. Actor update rules—such as those following the policy gradient theorem—use the critic to evaluate actions and compute advantages, often via estimators of the form:

θiJ(θ)=E[θilogπθi(aihi)(Qπ(s,a)b(s))]\nabla_{\theta_i} J(\theta) = \mathbb{E} \left[ \nabla_{\theta_i} \log \pi_{\theta_i}(a_i \mid h_i) \left(Q^\pi(s, \vec a) - b(s)\right) \right]

where b(s)b(s) is a baseline derived from the central critic (Lyu et al., 2022, Lyu et al., 26 Aug 2024).

2. Critic Architectures and Variants

Centralized critics can leverage various forms of available information, leading to several architecture variants:

Critic Variant Input Features Example Usage
State-based (SC) Global state ss SMAC, particle worlds (deterministic or nearly fully observable) (Lyu et al., 2022, Lyu et al., 26 Aug 2024)
History-based (HC) Joint histories hh Dec-Tiger, Box-Pushing (partial observability dominates) (Lyu et al., 2022, Lyu et al., 26 Aug 2024)
Joint-Observation (o1,...,oI)(o_1,...,o_{|I|}) Cooperative navigation (Lyu et al., 2022, Lyu et al., 26 Aug 2024)
Hybrid (HSC) (h,s,a)(h, s, a) Combines information for improved bias/variance (Lyu et al., 2022)

Architecturally, SC typically employs an MLP embedding of ss; HC applies RNNs to full histories; HSC concatenates RNN and global state embeddings to inform value estimation (Lyu et al., 2022).

3. Bias and Variance Analysis in Centralized-Critic Gradients

Substantial bias and variance effects emerge in the use of state-based centralized critics under partial observability:

  • Bias: When multiple distinct histories hh map onto the same state ss, but those histories yield different value-to-go, the expectation over Qπ(s,a)Q^\pi(s, a)—used in the policy gradient—does not match the correct history-based value Qπ(h,a)Q^\pi(h, a). As established by rigorous analysis (Lyu et al., 2022, Lyu et al., 26 Aug 2024):

Qπ(h,a)Esρ(sh)[Qπ(s,a)]Q^\pi(h, a) \neq \mathbb{E}_{s \sim \rho(s|h)} [ Q^\pi(s, a) ]

leading to policy gradient bias.

  • Variance: Even when bias is eliminated (e.g., if histories in each state are value-equivalent), the state-based estimator's variance is provably greater or equal to the history-based estimator's variance due to the random mapping from histories to states (Lyu et al., 2022):

Var[Qπ(s,a)logπ(ah)]Var[Qπ(h,a)logπ(ah)]\mathrm{Var}\left[ Q^\pi(s,a) \nabla \log \pi(a|h) \right] \ge \mathrm{Var}\left[ Q^\pi(h,a) \nabla \log \pi(a|h) \right]

  • Empirical effects: Domains such as Dec-Tiger and Box-Pushing (with strong information-gathering requirements and deep history dependencies) expose detrimental bias and variance effects, often causing state-based critics to converge to suboptimal policies, whereas history- or hybrid-based critics yield better performance and stability (Lyu et al., 2022, Lyu et al., 26 Aug 2024).

4. Empirical Evidence and Environment Properties

Empirical comparisons indicate that the architectural choice of the critic must match the environment's observability and history structure:

Domain Information Dependency SC (State Critic) HC (History Critic) HSC (Hybrid)
SMAC (full/near-full obs) Low (reactive optimal) Best or tied Tied Tied
Meeting-in-a-Grid Low Fastest, optimal Tied Tied
Dec-Tiger, Box-Pushing, Cleaner High (history critical) Suboptimal Outperforms SC Best

On tasks like StarCraft Multi-Agent Challenge or cooperative navigation with minimal history dependence, state-based critics are effective and efficient. In contrast, in environments with necessary information-gathering or latent variable inference, only history- or hybrid-based critics avoid bias and attain optimal (or near-optimal) returns (Lyu et al., 2022, Lyu et al., 26 Aug 2024).

5. Centralized Critics in Asynchronous and Structured Multi-Agent Learning

The centralized-critic methodology is extended to domains that require asynchronous execution, temporally extended actions, or structured multi-agent cooperation. In asynchronous MARL, e.g., Mac-IAICC (Xiao et al., 2022), each agent executes macro-actions using its own decentralized actor but leverages a centralized, global-information critic at training time, with updates triggered by individual macro-action termination events. This design circumvents the need for agent synchronization and supports scalability in complex domains.

In structured domains, such as knowledge base editing (STACKFEED (Gupta et al., 14 Oct 2024)), a multi-actor, centralized-critic schema assigns each "document" to a separate actor, while a centralized critic decomposes global feedback into targeted per-actor updates. This design ensures credit assignment and mitigates nonstationarity among independently learning agents.

6. Design Guidelines, Practical Recommendations, and Limitations

The fundamental trade-offs summarized in recent analyses (Lyu et al., 2022, Lyu et al., 26 Aug 2024, Lyu et al., 2021) are as follows:

  • When to Favor State-based Critics (SC)
    • Observability is strong (local observations are almost sufficient to reconstruct state).
    • Optimal policies are reactive (history does not provide additional value).
    • Accelerated representation learning via compact state representations.
  • When to Favor History-based Critics (HC) or Hybrid Critics (HSC)
    • Strong partial observability, information-gathering, or long-horizon dependencies.
    • Environments with latent state, hidden variables, or necessity for belief-state tracking.
  • Hybrid Approaches
    • Combining state and history encodings or interpolating between V(s)V(s) and V(h)V(h) can reduce both bias and variance.
  • Potential Pitfalls
    • Centralized critics are not universally beneficial; in some benchmarks, excessive variance or systematic bias in partial observability leads to degraded sample efficiency or failure to converge to optimal coordination.
    • Scalability issues can arise in scenarios with large numbers of agents or high-dimensional state-action spaces.

It is empirically necessary to ablate critic varieties (SC, HC, HSC) on simplified instances of the target environment to estimate the degree to which histories beyond the current state impact optimal behavior and learning efficiency (Lyu et al., 2022).

7. Open Challenges and Future Directions

Despite their centrality in current MARL practice, centralized-critic actor-critic architectures remain an area of active research:

  • Systematic analysis of state-based critic bias and variance in large-scale, realistic partial observability scenarios is ongoing (Lyu et al., 2022, Lyu et al., 26 Aug 2024).
  • Advanced network architectures—e.g., attention mechanisms, structured critics (see (Garrido-Lestache et al., 30 Jul 2025))—are being developed to address the limitations of existing methods, including improved scalability, communication, and credit assignment.
  • Application-specific design, e.g., asynchronously updated critics or knowledge-editing with centralized critics (Xiao et al., 2022, Gupta et al., 14 Oct 2024), is expanding the operational capabilities and applicability of these methods.
  • The development and empirical validation of hybrid critics, importance-sampling corrections, and variance-reduction techniques remain priority open problems to close the gap between theory and large-scale, high-dimensional MARL practice (Lyu et al., 2022, Lyu et al., 26 Aug 2024).

Centralized-critic actor-critic methods are thus characterized by a nuanced set of trade-offs; their design and deployment require careful matching between agent architecture, critic structure, and the partial observability properties of the target domain.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Centralized-Critic Actor-Critic Methods.