Heterogeneous League Training (HLT)

Updated 3 June 2026

HLT is a reinforcement learning paradigm for heterogeneous multi-agent systems that leverages a dynamic league of current and historical policies to address cooperation and non-stationarity challenges.
It employs episodic partner mixing, hypernetwork-based adaptation, and prioritized policy gradients to manage imbalanced agent roles and enhance scalability in complex environments.
Empirical evaluations show HLT achieving over 90% win-rate in large-scale benchmarks, outperforming traditional methods in adaptability and cross-version compatibility.

Heterogeneous League Training (HLT) is a population-based reinforcement learning paradigm developed to address the unique challenges posed by cooperative multi-agent systems comprised of diverse agent types. Extending classical league training methodologies, HLT structures agent policy optimization around a dynamic league composed of both current and historical heterogeneous policy groups, facilitating diversified training partners, stabilization against non-stationarity, and enhanced robustness to policy-version drift. By leveraging episodic partner mixing, hypernetwork-based adaptation, and prioritized update mechanisms, HLT achieves strong empirical performance and scalability in cooperative environments exhibiting large, imbalanced agent populations and division of labor across multiple roles (Fu et al., 2024, Fu et al., 2022, Han et al., 2020).

1. Mathematical Formulation and Objectives

HLT models the heterogeneous multi-agent problem as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP):

$H = \langle A,\,\Delta,\,U,\,S,\,P_t,\,O,\,P_o,\,R,\,\gamma\rangle$

where

$A = \{a_1,\ldots,a_N\}$ denotes agents,
$\Delta = \{\delta_1,\ldots,\delta_M\}$ the agent types,
$U = U_1 \times \ldots \times U_N$ the joint action space,
$S$ the global state,
$O$ joint observations,
$P_t, P_o$ transition and observation dynamics, and
$R: S \rightarrow \mathbb R$ the shared team reward with discount $\gamma$ .

Agents of a given type $\delta_m \in \Delta$ share parameters and act via a type-specific policy $A = \{a_1,\ldots,a_N\}$ 0, yielding a policy group $A = \{a_1,\ldots,a_N\}$ 1. The HLT objective is threefold:

Mitigate policy non-stationarity arising from simultaneous adaptation of distinct agent types.
Compensate for sampling imbalance between abundant and rare roles using priority weights.
Enable scalable, robust cooperation under the centralized training, decentralized execution (CTDE) paradigm (Fu et al., 2024).

2. League Architecture and Policy Management

The fundamental architectural innovation in HLT is the construction and management of a heterogeneous policy league:

Frontier group $A = \{a_1,\ldots,a_N\}$ 2: Parameters actively updated by reinforcement learning.
League $A = \{a_1,\ldots,a_N\}$ 3: A buffer of frozen policy groups periodically snapshot from the frontier, pruned by a diversity or performance-based rule. The league is capped at size $A = \{a_1,\ldots,a_N\}$ 4 (Fu et al., 2024, Fu et al., 2022).
Policy mixing: At the start of each episode, HLT samples (1) a league member $A = \{a_1,\ldots,a_N\}$ 5 and (2) a type $A = \{a_1,\ldots,a_N\}$ 6. The temporary mixed policy $A = \{a_1,\ldots,a_N\}$ 7 replaces $A = \{a_1,\ldots,a_N\}$ 8 in $A = \{a_1,\ldots,a_N\}$ 9 with its counterpart from $\Delta = \{\delta_1,\ldots,\delta_M\}$ 0; all agents of that type deploy the league policy, others use the frontier (Fu et al., 2024, Fu et al., 2022).

This strategy exposes the active policies to a range of historical partner behaviors, stabilizing learning and increasing compatibility across policy versions—a property critical in real-world deployments where policy upgrades must remain interoperable (Fu et al., 2022).

3. Optimization Objectives and Prioritized Policy Gradients

HLT utilizes an actor-critic framework, typically with Proximal Policy Optimization (PPO) or actor-critic objectives. Key features include:

Prioritized weighting: To address class imbalance, a sample-specific weight

$\Delta = \{\delta_1,\ldots,\delta_M\}$ 1

adjusts the gradient contribution, where $\Delta = \{\delta_1,\ldots,\delta_M\}$ 2 is the expected performance (win-rate) of the policy group with type $\Delta = \{\delta_1,\ldots,\delta_M\}$ 3 replaced from the league, and $\Delta = \{\delta_1,\ldots,\delta_M\}$ 4 is the averaged performance across all types (Fu et al., 2024).

Policy surrogate loss (PPO-style):

$\Delta = \{\delta_1,\ldots,\delta_M\}$ 5

with $\Delta = \{\delta_1,\ldots,\delta_M\}$ 6 and advantage estimate $\Delta = \{\delta_1,\ldots,\delta_M\}$ 7 (Fu et al., 2022). The advantage is further modulated by the prioritized weighting $\Delta = \{\delta_1,\ldots,\delta_M\}$ 8 (Fu et al., 2024).

4. Hypernetwork Conditioning and Adaptation

Each agent’s policy comprises a main network modulated by a hypernetwork that conditions not only on the agent’s observation, but also on a policy-identity vector encoding (1) one-hot agent type and (2) the mix of frontier vs. league policy participation for each type in the current episode (Fu et al., 2022, Fu et al., 2024). The conditioning signal

$\Delta = \{\delta_1,\ldots,\delta_M\}$ 9

is input into the hypernetwork, which generates the parameterization for the type-specific policy. This design enables agents to modulate their cooperation strategy adaptively based on current team composition, ensuring greater robustness and transferability across diverse team configurations.

5. Algorithmic Loop and League Maintenance

HLT alternates between policy optimization and league maintenance:

Rollout phase: For each episode, sample a mixed policy group by replacing one agent type with a league member, gather transitions, and record partner indices (Fu et al., 2024, Fu et al., 2022).
Update phase: Using the aggregated rollouts, compute PPO updates on the frontier policy, with prioritized weight adjustment for type imbalance.
League update: Every $U = U_1 \times \ldots \times U_N$ 0 iterations, clone the current frontier group, append to $U = U_1 \times \ldots \times U_N$ 1, and, if capacity is exceeded, prune by diversity or similarity in performance space, often removing the more recent of a close pair (Fu et al., 2024, Fu et al., 2022).

Empirical and practical studies indicate that the diversity and update cadence of the league are critical hyperparameters: too small a league impairs robustness; too large increases computational overhead and can slow learning (Fu et al., 2022, Han et al., 2020).

6. Empirical Evaluation and Benchmarks

HLT has been evaluated extensively in both simulated multi-agent cooperation environments and high-dimensional competitive domains:

LSMO benchmark (Unreal Engine): Teams of up to 102 agents per side (drone, missile, gun vehicle types). In LSOP-66×2, Prioritized HLT achieved >90% win-rate within 60k episodes, outperforming CTDE baselines such as QTRAN (≈30%) and QPLEX (≈45%). HLT maintained >80% win-rate in all tested scales up to 102v102 (Fu et al., 2024).
Compatibility and role analysis: In scenarios with repeated replacement of active types with past policies, HLT maintains >80–90% win-rate, confirming high cross-version compatibility (Fu et al., 2022).
StarCraft II full game (TStarBot-X): League organization with main agent, exploiters, and evolutionary branches substantially outperformed homogeneous self-play or reduced league variants, achieving performance competitive with high-ranking human players (Han et al., 2020).

Benchmark	HLT Result	Main Baseline	Baseline Result
LSOP-66×2	>90% win-rate	QPLEX	≈45%
UHMP (2u-4m-4k)	98.09% win-rate	FT-Qmix-D	91.79%
SC2 (TStarBot-X)	Competitive with GMs	Surrogate League	Substantially lower

7. Strengths, Limitations, and Extensions

HLT provides robustness to teammate non-stationarity, fair optimization across role imbalances, and empirical scalability. However, it introduces additional overhead for league management and hyperparameter sensitivity in league size and update cadence (Fu et al., 2024, Fu et al., 2022). Centralized computation remains necessary for league sampling and evaluation during training, though execution is decentralized. Extensions include adaptive league sizes, advanced diversity-based pruning, and potential theoretical convergence analysis under prioritized sampling.

A plausible implication is that domains featuring highly non-transitive competition (as in SC2) or extreme type imbalance stand to benefit most, especially when domain knowledge can guide design of specialized exploiters or critical state rules (Han et al., 2020). For applications with vast or continuous type spaces, clustering of agent types or selective frontier updates may be necessary to preserve tractability and diversity (Fu et al., 2022).

Key References:

"Prioritized League Reinforcement Learning for Large-Scale Heterogeneous Multiagent Systems" (Fu et al., 2024)
"Learning Heterogeneous Agent Cooperation via Multiagent League Training" (Fu et al., 2022)
"TStarBot-X: An Open-Sourced and Comprehensive Study for Efficient League Training in StarCraft II Full Game" (Han et al., 2020)