Centrally-Weighted QMIX in Multi-Agent RL
- The paper introduces a weighted projection mechanism that prioritizes high-value joint actions to overcome representational limits in standard QMIX.
- CW-QMIX refines the monotonic mixing framework by incorporating a central critic and tailored TD losses for improved cooperative policy learning.
- Empirical results indicate CW-QMIX achieves up to 40 percentage points higher win rates in complex environments like SMAC under high exploration.
Centrally-Weighted QMIX (CW-QMIX) is a cooperative multi-agent reinforcement learning (MARL) algorithm designed to address representational limitations in standard monotonic value function factorisation. It builds on the QMIX framework, which factorises the joint action-value function using monotonic mixing networks to enable decentralised execution via centralised training. CW-QMIX introduces a weighted projection mechanism that prioritises accuracy on high-value joint actions when projecting learning targets onto the representable subspace, thus enhancing performance in cooperative settings where joint optimality is critical (Rashid et al., 2020).
1. Background: QMIX and Monotonic Value Function Factorisation
QMIX formulates the joint action-value function as a monotonic mixing of per-agent utility values:
subject to the monotonicity constraint
for each agent . This restriction guarantees that the decentralised greedy maximisation
can be performed by each agent independently choosing actions to maximise its own local , resulting in linear scalability with the number of agents (Rashid et al., 2018). However, this monotonic constraint means that QMIX can only represent joint value functions that are non-decreasing in each agent’s utility. Consequently, QMIX cannot express interactions where an agent’s optimal local action depends non-monotonically on the actions of others. In such situations, the optimal joint action may not be representable in the form , and the projection performed by QMIX may misalign the optimal action or underestimate its value (Rashid et al., 2020).
2. Motivation for Centrally-Weighted Projection
The standard QMIX projection minimises the unweighted loss between the “target” joint-action value (e.g., Bellman targets or a central critic estimate) and the monotonic representable set:
This treats all joint actions equally, causing the projection to prioritise reducing errors on the majority of suboptimal actions rather than preserving accuracy on the optimal joint action. When performance is determined by identifying the optimal joint action, this approach can “crowd out” the best action, leading to suboptimal policies.
CW-QMIX addresses this issue by introducing a weighting and minimising the weighted squared error:
0
with a typical choice (“Idealised Central Weighting”):
1
where 2 and 3 is a tunable factor. This scheme downweights all but the best joint action, ensuring that the projection’s maximiser matches that of the underlying target for sufficiently small 4 (Rashid et al., 2020).
3. CW-QMIX Algorithmic Workflow and Network Architecture
CW-QMIX uses the core monotonic mixing architecture of QMIX: per-agent utility networks, a monotonic mixing network with non-negative weights, and state-conditioned hypernetworks to generate the mixing weights and biases.
Per-Agent Networks:
- GRU (64 hidden units) stacked after a fully-connected layer, taking as input the agent’s observation, previous action, and agent-ID.
- Final linear layer outputs 5 across available actions.
Mixing Network:
- Input: vector of per-agent 6’s
- Layer 1: weighted sum via 7, 8 (state-dependent), ELU activation, size 32.
- Layer 2: weighted sum via 9, 0, outputting scalar 1.
Hypernetworks:
- 2, 3: generated by feed-forward MLPs with elementwise absolute value to ensure non-negativity.
- 4, 5: MLP outputs; 6 is produced by a two-layer ReLU MLP.
Central Critic 7 (unconstrained):
- Utilises the same per-agent features as above.
- Outputs are concatenated and input to a three-layer feed-forward MLP (256 units per layer, ReLU activations), with no monotonicity constraints.
Training Details:
- Replay buffer, batch size 32, RMSProp (8), discount 9.
- Target network updates every 200 training steps.
- Exploration via 0-greedy scheduling (annealing detailed per environment).
The core innovation is the TD loss for 1, using the per-sample weight 2:
3
where 4 uses the central critic estimate for the target.
Weights 5 are assigned via:
- 6 if 7 or if 8 (“central” actions),
- 9 for remaining joint actions (Rashid et al., 2020).
4. Theoretical Guarantees
For the idealised tabular setting, it is proven that the centrally-weighted projection 0 with sufficiently small 1 produces a maximiser matching the true argmax of the target 2 (Theorem 4.1). This result implies that, as long as the weighted projection is used for Bellman updates (with an unrestricted critic supplying the target), the learned policy converges to the optimal one as in fitted Q-iteration. The proof operates by showing that any monotonic factorisation which misplaces the maximum jointly incurs a penalty (at least the squared action gap 3) and that this penalty dominates as suboptimal actions are downweighted by 4. A direct corollary is that CW-QMIX with weighted projection and a central critic avoids the representational bottleneck of standard QMIX in recovering the optimal joint action, even when the true 5 is not monotonic (Rashid et al., 2020).
5. Empirical Validation and Benchmark Results
CW-QMIX is evaluated on two principal benchmarks: the Predator-Prey task and the StarCraft Multi-Agent Challenge (SMAC). Key findings:
- Predator-Prey (8 agents):
- Requires simultaneous capture by two agents for positive return; lone agent incurs penalty.
- QMIX, MADDPG, MASAC, QPLEX fail to achieve positive return.
- CW-QMIX (and OW-QMIX) rapidly solve the task (median return ≈ 6), outperforming QTRAN in both learning speed and final performance.
- SMAC (StarCraft):
- Under high exploration (annealed 7 over 1M steps), QMIX's performance degrades, while CW-QMIX sustains high win rates (>90%) on 3s5z and 5m_vs_6m maps.
- On hard maps (6h_vs_8z), only CW-QMIX and OW-QMIX recover winning policies; QMIX fails.
- For 3s5z_vs_3s6z, all methods converge to ∼80% win rate, indicating factorisation is not the limiting factor.
- On corridor, vanilla CW-QMIX underperforms unless the central critic is ablated to use a hypernetwork with column-softmax, after which it matches/exceeds QMIX.
Quantitatively, CW-QMIX achieves up to 40 percentage points higher win rates than QMIX in high-exploration settings and is able to solve previously unsolved coordination tasks (Rashid et al., 2020).
6. Practical Tradeoffs and Limitations
CW-QMIX maintains the central tradeoff that underpins monotonic factorisation: only monotonic joint-action value functions are exactly representable, so inherently non-monotonic value landscapes (where the optimal joint action is achievable only via non-monotonic interaction) remain approximate. Nevertheless, the weighted projection guarantees recovery of the optimal policy in practical training regimes with a fitted critic. CW-QMIX requires marginally higher computation due to maintaining both the monotonic mixing network and a full central critic.
The practical success of CW-QMIX in robust learning scenarios (e.g., under high exploration noise) suggests that prioritising projection accuracy on high-value joint actions is a critical mechanism for overcoming representational bottlenecks observed in earlier monotonic factorisation approaches. Architecturally, CW-QMIX is sensitive to the design of the central critic; performance on certain maps improves when the critic utilises additional hypernetworks (e.g., with column-softmax) (Rashid et al., 2020).