Centralized Value Learning in MARL

Updated 31 January 2026

Centralized value learning is a methodology in multi-agent reinforcement learning where centralized critics leverage global information to optimize decentralized policies under CTDE.
It addresses the bias–variance tradeoff by comparing state-based and history-based critics, with design choices impacting sample efficiency and coordination.
Algorithmic frameworks such as VDN, QMIX, and SMIX(λ) showcase its practical application and performance in various cooperative multi-agent tasks.

Centralized value learning refers to the set of methodologies in (multi-agent) reinforcement learning (MARL) whereby one learns value functions—typically state- or action-value functions—with access to global information, centralized during training, to facilitate the coordinated optimization of decentralized policies. The paradigm is most prominently instantiated in Centralized Training with Decentralized Execution (CTDE), now the standard for cooperative MARL. Within CTDE, the value function (or critic) may condition on the full joint state, joint histories, or all agents’ actions, and can be decomposed, mixed, or distilled to support tractable and scalable decentralized execution.

1. Formal Foundations and Value Function Structures

Centralized value learning operates over the formalism of Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs), involving $N$ agents, a global state space $S$ , joint action space $A=A_1\times\dots\times A_N$ , and agent-specific observation processes. Each agent's local policy is $\pi^i_\theta(a^i_t|h^i_t)$ , with $h^i_t$ its own action-observation history; joint policy is $\pi_\theta(a_t|h_t)=\prod_{i=1}^N\pi^i_\theta(a^i_t|h^i_t)$ . Centralized value functions, trained offline, may condition on either:

Full state: $V(s_t)$ or $Q(s_t,a_t)$
Joint history: $Q(h_t,a_t)$

History-based critics $Q(h,a)$ are unbiased but high-variance estimators; state-based critics $Q(s,a)$ can be biased under partial observability but generally offer lower variance. During execution, decentralized policies $\pi^i_\theta$ are used, constrained to local histories $h^i_t$ (Lyu et al., 2024).

Value decomposition schemes such as VDN hypothesize an additive decomposition $Q_{tot}(\tau,a)\approx\sum_{i=1}^N Q_i(\tau^i,a^i)$ , facilitating decentralized execution by enabling local argmax action selection (Sunehag et al., 2017).

2. Theoretical Bias–Variance Tradeoffs of Centralized Critics

A central theoretical insight is the bias–variance tradeoff inherent in centralized value learning under CTDE. The true policy gradient utilizes a critic $Q^*(h_t,a_t)$ conditioned on available information, while state-based centralized critics $Q^c(s_t,a_t)$ can be biased if $h_t$ does not resolve $s_t$ , i.e., under partial observability. The gradient bias is quantified as:

$B = \mathbb{E}_{h_t,a_t}\left[\nabla_\theta\log\pi_\theta(a_t|h_t)\Delta_Q(h_t,a_t)\right], \quad \Delta_Q(h,a) = \mathbb{E}[Q^c(s,a)|h,a] - Q^*(h,a)$

Variance analysis shows that centralized critics generally entail lower estimator variance, as conditioning on global state $s$ provides more information than on history $h$ , but this comes at the cost of potential bias which may impact convergence and final policy quality (Lyu et al., 2024).

Empirical evaluation confirms:

In domains with mild partial observability, state-based critics accelerate learning due to reduced variance.
Under severe partial observability (e.g., SMAC 5m_vs_6m), centralized critics incur bias, leading to poor coordination and lower returns compared to unbiased history-based critics.
Hybrid solutions, such as regularizing the state-based critic towards posterior consistency over $\Pr(s|h)$ , strike a balance between bias and variance (Lyu et al., 2024).

3. Algorithmic Realizations in CTDE, Decomposition, and Mixing

Centralized value learning is implemented in multiple algorithmic frameworks:

CTDE Actor–Critic: Actors maintain decentralized policies, while the critic (e.g., in MADDPG or COMA) is centralized. Training uses replay buffers and target networks for stability; only the decentralized actors are used in execution (Lyu et al., 2024).
Value-Decomposition Networks (VDN): Joint Q-values are approximated additively, enabling decentralized greedy action selection at execution. Architectural extensions incorporate shared weights, role IDs, and communication channels to improve credit assignment and robustness (Sunehag et al., 2017).
Mixing Networks (QMIX, SMIX $(\lambda)$ ): QMIX employs monotonic mixing—centralized Q is strictly increasing in each agent’s utility—while SMIX $(\lambda)$ leverages multi-step $\lambda$ -returns to stabilize off-policy learning under the curse of dimensionality, reducing bias/variance without explicit importance sampling (Yao et al., 2019).

A significant advancement is the use of off-policy critics that generalize across policies. Centralized value learning via probing-state fingerprints trains a single critic that predicts the return of any policy by embedding the policy’s actions on a learned set of probing states. This supports sample-efficient learning, architecture invariance, and meta-RL applications (Faccio et al., 2022).

4. Empirical Findings and Domain-Specific Performance

Benchmark evaluation spans particle environments, SMAC, tabular gridworlds, and multi-robot navigation:

Centralized critics (COMA, MADDPG) improve sample efficiency and credit assignment in small-scale or mildly partially-observable domains, but may not outperform decomposed approaches (VDN, QMIX) as team size and environment complexity grow (Amato, 2024, Sunehag et al., 2017).
CTDS (Centralized Teacher-Decentralized Student) distillation robustly transfers global information to decentralized agents, yielding faster convergence and higher win rates over baseline mixing methods in StarCraft II micromanagement tasks (Zhao et al., 2022).
SMIX $(\lambda)$ achieves state-of-the-art win rates on SMAC, scaling to 25 agents, and plug-and-play improves existing CTDE algorithms by enhancing the central critic update (Yao et al., 2019).
In tabular settings with explicit embodiment constraints, centralized value learning can underperform fully independent (decentralized) Q-learning, especially under asymmetric agent roles or tight kinematic constraints. Mixed centralized-independent schemes manifest persistent coordination failures (Atif et al., 24 Jan 2026).
Centralized state-value learning in dueling networks (GDQ) outperforms QMIX/VDN in mapless multi-robot navigation, demonstrating improved sample efficiency and lower collision rates (Marchesini et al., 2021).

5. Architectural Design Choices and Practical Guidance

Key design choices for centralized value learning include:

Selection between state-based and history-based critics, guided by observability and coordination demands.
Degree of value factorization—additive (VDN), monotonic mixing (QMIX), non-linear (QTRAN, QPLEX)—to balance expressiveness, tractability, and decentralized executability (Amato, 2024).
Deployment of communication channels, parameter sharing, attention mechanisms (MAAC) to mitigate non-stationarity and variance (Lyu et al., 2021, Sunehag et al., 2017).

Practical recommendations:

Favor state-based centralized critics when partial observability is mild and variance is the learning bottleneck.
For severe partial observability or critical coordination, prefer history-based critics or regularized centralized critics.
Monitor posterior consistency error $\mathbb{E}[\operatorname{Var}_{s|h}Q(s,a)]$ to detect bias.
Utilize replay buffers, off-policy multi-step returns, and decentralized execution architectures for scalable training (Lyu et al., 2024, Yao et al., 2019).

Table: Algorithmic Schemes for Centralized Value Learning

Method	Critic Structure	Execution Mode
VDN	Additive per-agent	Decentralized greedy
QMIX	Monotonic mixing	Decentralized greedy
COMA/MADDPG	Full joint state	Decentralized actor
SMIX $(\lambda)$	Mixing $\lambda$ -return	Plug-and-play
CTDS	Distilled teacher	Decentralized student

6. Limitations, Failure Modes, and Controversies

Recent analysis cautions against viewing centralized critics as universally beneficial. Primary limitations include:

Bias under partial observability when conditioning on unobservable states.
Increased gradient variance (multi-action and multi-observation variance) with centralized critics as team size or unobserved information grows (Lyu et al., 2021).
Poor scalability in domains with high-dimensional joint action spaces; decomposition architectures help but do not eliminate exponential complexity.
In tabular domains, centralized learning can fail under embodiment constraints, and in mixed centralized-independent configurations, coordination breakdowns are persistent rather than transient (Atif et al., 24 Jan 2026).

A plausible implication is that “increased coordination” through a universal central critic should be matched to the problem’s structure; indiscriminate centralization may hinder learning, particularly with role or kinematic asymmetries.

7. Summary and Outlook

Centralized value learning is an essential methodology in MARL, providing theoretical and practical benefits for credit assignment, sample efficiency, and coordinated training. Its efficacy, however, is domain- and architecture-dependent, requiring judicious bias–variance management, thoughtful algorithmic design, and sensitivity to observability and agent structure. Ongoing developments focus on hybrid critics, efficient distillation mechanisms, scalable mixing architectures, and principled criteria for centralization vs. decentralization—reflecting a nuanced balance between global optimality and practical robustness (Lyu et al., 2024, Sunehag et al., 2017, Zhao et al., 2022, Atif et al., 24 Jan 2026).