Centralized Critic Techniques in MARL

Updated 5 March 2026

Centralized Critic Techniques are methods in multi-agent reinforcement learning that use a global critic to provide joint value estimations under the CTDE framework.
They incorporate diverse critic architectures such as state-based, history-based, and attention-based models to effectively assign credit and guide policy updates.
While enhancing coordination and credit assignment across agents, these techniques also introduce challenges like higher gradient variance and scalability issues.

A centralized critic is a component in multi-agent reinforcement learning (MARL) architectures that, during training, is granted access to the global system state, the actions of all agents, or the joint observation–action history. It provides joint value estimation or advantage information for optimizing decentralized agents’ policies. Centralized critic techniques are central to the “Centralized Training for Decentralized Execution” (CTDE) paradigm, allowing the critic to exploit global context offline while actors retain decentralized policies for online execution. Under this design, the centralized critic need not be present at execution; only the learned decentralized actors operate using their local observations or histories.

1. Formal Problem Setting and Definitions

Centralized critic techniques are most commonly studied in Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs), where $n$ agents interact with an environment by executing actions based solely on their respective observation histories $h_i = (o_i^1, a_i^1, \dots, o_i^t)$ . The environment state $s_t$ is governed by a transition kernel $P(s_{t+1} | s_t, a_t^1, \dots, a_t^n)$ , and agents receive a shared or individual reward $R(s_t, a_t)$ at each step (Lyu et al., 2024).

In the CTDE setup, the critic is trained by conditioning on additional global information—joint observation histories $(H_t)$ , full global state $s_t$ , or all agents’ actions—while each agent’s policy $\pi_i$ is constrained to depend only on local observations or histories (Lowe et al., 2017, Iqbal et al., 2018, Xiao et al., 2021).

Centralized critics can be further classified by the domain on which they operate:

State-based centralized critic: $Q^{\pi}(s, a)$ , where $s$ is the true global state and $a$ the joint action.
History-based centralized critic: $Q^{\pi}(H, a)$ , where $H$ is the full joint history.
Hybrid or decomposed critics: Functions such as $Q_{tot}(\tau, u)$ which are monotonic/nonlinear decompositions over per-agent critics (Wang et al., 24 Nov 2025).

The standard actor–critic gradient objective under a centralized critic is: $\nabla J(\theta) = \mathbb{E}_{H,a}\left[ Q^{\pi}(H, a) \nabla_\theta \log \pi(a | H) \right]$ where $H$ summarizes the full joint information available during centralized training.

2. Theoretical Properties: Bias, Variance, and Validity

Critic centralization presents nontrivial effects on the bias and variance of gradient estimators in partially observable environments. Key results (Lyu et al., 2024, Lyu et al., 2021) are as follows:

Unbiasedness and Existence:
- In fully observable settings or when all agents are reactive (policies depend only on the current observation), state-based critics yield unbiased policy gradients and coincide with history-based critics.
- In general PO environments with non-reactive (history-dependent) agents, $Q^{\pi}(s, a)$ may be ill-defined; future transitions depend on hidden history, so $Q^{\pi}(H, a)$ is required for unbiasedness.
- Employing $Q^{\pi}(s, a)$ in these settings introduces bias: it erases dependencies beyond $s$ , systematically misweighting policy contributions.
Variance Trade-offs:
- Centralized critics have strictly higher variance in the per-sample policy gradient as joint histories/actions must be sampled, increasing the number of stochastic variables and, hence, estimator variance (Lyu et al., 2021).
- Decentralized critics—though potentially more biased due to stale or inaccurate local value approximations—provide lower-variance gradients for each actor.
Practical Implications:
- Use history-based critics whenever agents’ policies operate over non-Markovian joint histories.
- State-based centralized critics are justified (unbiased) in fully observable or fully reactive settings; in all other cases, they risk introducing bias and extra variance.

3. Architectural Patterns and Algorithmic Realizations

Centralized critic techniques manifest in multiple major MARL algorithmic families, often in coordination with decentralized actors. The most prominent patterns include:

Independent Actor Centralized Critic (IACC): Each agent has a decentralized policy $\pi_i$ , while the critic $Q_c$ is trained over the full global state and joint action. Each policy is updated via the centralized advantage information (Amato, 2024, Lowe et al., 2017).
Counterfactual Credit Assignment (COMA-style): A centralized critic estimates $Q(s, a)$ , and an advantage is computed by marginalizing or sampling counterfactual actions for each agent, providing fine-grained local credit assignment (Amato, 2024).
Attention-based Centralized Critics: Mechanisms such as MAAC and TAAC integrate multi-headed attention within the critic to focus on crucial inter-agent relations, efficiently handling joint spaces (Garrido-Lestache et al., 30 Jul 2025, Iqbal et al., 2018).
Decomposition-based and Value Factorization Critics: Techniques like VDN, QMIX, and MCEM-NCD decompose the joint critic into per-agent value functions with monotonic mixing or nonlinear transformations, balancing expressivity, decentralized policy gradients, and sample efficiency (Wang et al., 24 Nov 2025).
Centralized Option-Critic: Involves joint option-evaluation using global belief states and decentralized intra-option policy improvement to capture temporal abstractions in cooperative Dec-POMDPs (Chakravorty et al., 2019).

Pseudocode for these architectures typically alternates between critic updates using global (joint) transitions and actor updates using policy gradients weighted by the centralized advantage or value signal (Lowe et al., 2017, Amato, 2024, Xiao et al., 2021).

Critic Type	Input	Advantage/Update Target
State-based	$Q(s, a)$	TD( $r + \gamma Q(s', a')$ )
History-based	$Q(H, a)$	TD( $r + \gamma Q(H', a')$ )
Attention-based	$Q(o_1,\dots,o_N, a_1,\dots,a_N)$	Attention over agents
Decomposition (QMIX, MCEM-NCD)	$\{Q^i(h_i, a_i)\}$	Monotonic/nonlinear mixing function

4. Practical Implementation: Representation, Scalability, and Training

Critical practical challenges include:

Representation Learning Under PO: Accurately approximating $Q^{\pi}(H, a)$ in partially observed environments requires powerful recurrent or memory-augmented architectures to encode long, high-dimensional joint histories. The need for expressivity often justifies architectures with recurrent networks or belief-state estimators (Lyu et al., 2024).
Critic Network Scaling: State/joint-history-based critics must process high-dimensional global input, causing computational and statistical challenges as the number of agents grows. Attention mechanisms alleviate enumeration costs by focusing on salient agent dependencies (Garrido-Lestache et al., 30 Jul 2025, Iqbal et al., 2018).
Variance and Bias from Distributional Mismatch: The weighting of state-action pairs arising from the geometric discount can mismatch empirical sample frequencies, subtly shifting gradient scales (Lyu et al., 2024).
Decentralized Execution: Only decentralized actor policies are retained after training; the critic operates solely during training (Amato, 2024, Lowe et al., 2017).

Implementation (sample pseudocode) for centralized critic plus decentralized actors typically consists of:

Sampling transitions with all agents acting per their policies.
Computing joint value/advantage via the centralized critic.
Updating actor parameters using policy gradients with the centralized advantage.
Updating the critic via temporal-difference regression over the replay buffer.

5. Empirical Results and Benchmark Findings

A variety of empirical studies demonstrate the advantages and limitations of centralized critic techniques compared to fully decentralized critics across benchmark tasks:

Centralized critics confer major benefits on coordination-heavy tasks, especially where agents must align actions for joint objectives (e.g., cooperative navigation, physical deception, multi-agent communication) (Lowe et al., 2017, Iqbal et al., 2018).
Attention-based centralized critics (MAAC, TAAC) yield superior collaboration metrics and higher win rates in multi-agent competitive domains by efficiently focusing value estimation on relevant agent interactions (Garrido-Lestache et al., 30 Jul 2025, Iqbal et al., 2018).
In settings where individual agent rewards are weakly aligned or credit-assignment is ambiguous, centralized critics with counterfactual or attention-based mechanisms solve previously intractable scenarios (Iqbal et al., 2018, Chakravorty et al., 2019).
Critic centralization does not universally improve stability; empirical evaluation shows that high sample variance and run-to-run instability can manifest under centralized critic updates—a cost that grows with the number of agents or under partial observability if the critic fails to capture all dependencies (Lyu et al., 2021).
Hybrid methods that distill the centralized critic into local value functions (e.g., ROLA) achieve lower variance and higher robustness, exploiting both central information and local credit isolation (Xiao et al., 2021).

Empirical limitations include increased input dimensionality for the critic as scale grows, potential privacy bottlenecks from centralized data requirements, and degraded performance when train–test observability mismatches occur (Lee et al., 2020, Amato, 2024).

6. Extensions, Decompositions, and Specialized Techniques

Recent research has introduced extensions and refinements to address the sample complexity, scalability, and expressivity challenges of centralized critic techniques:

Value Decomposition Networks (VDN/QMIX/MCEM-NCD): These methods decompose $Q_{tot}$ into a sum or nonlinear monotonic mix of per-agent values, enabling decentralized execution and policy updates, and resolving the centralized–decentralized mismatch (CDM) (Wang et al., 24 Nov 2025).
Cross-Entropy and Elite Sampling: MCEM-NCD uses percentile-greedy cross-entropy updates to select agent actions that contribute to high-value joint behavior, mitigating deleterious gradient effects from suboptimal team members (Wang et al., 24 Nov 2025).
Quantum Centralized Critics: QMACN replaces classical neural networks with parameterized quantum circuits for the centralized value estimator, achieving higher capacity-to-parameter ratios and superior robustness in noisy or adversarial environments (Park et al., 2023).
Application in Structured Non-Standard Domains: STACKFEED adapts centralized critic techniques to multi-document knowledge base editing, with the critic aggregating global feedback across documents and mediating low-variance, reflection-based credit assignment to ReACT-style document-specific actors (Gupta et al., 2024).

7. Design Guidelines, Limitations, and Open Problems

Centralized critic techniques are not universally optimal. Key guidelines for practitioners, drawn from both theory and experiment, include (Lyu et al., 2024, Lyu et al., 2021, Amato, 2024):

Prefer history-based centralized critics when actor policies depend on extended observation histories; these preserve unbiased gradients and reflect the true decision process.
State-based centralized critics are only justified when policies are fully reactive or the global state is Markov (i.e., fully observable).
Centralized critics provide improved credit assignment and reduce non-stationarity, but their gradients are higher-variance than decentralized critics, especially as the agent population scales or under deep PO.
In large systems or those with decentralized privacy/data constraints, value-decomposition or attention-based architectures offer tractable alternatives.
Implementation must address the discount-driven weighting versus empirical streaming frequencies—a subtlety in the practical optimization landscape.
Open questions persist regarding the formal construction of valid state-based critics in non-Markovian settings, the design of hybrid critics that maintain correctness and mitigate variance, and scalable, expressive, and robust critic architectures for domains with complex, high-dimensional histories.

A plausible implication is that despite their ubiquity and high empirical performance, centralized critic techniques demand careful alignment between problem observability, policy structure, and representation power to avoid subtle biases and excess variance in multi-agent reinforcement learning.