Heterogeneous Agent Collaborative RL

Updated 4 July 2026

Heterogeneous Agent Collaborative Reinforcement Learning is defined by leveraging non-identical agents’ unique observations, actions, and dynamics to achieve cooperative performance.
Core methodologies include sequential policy updates, grouped factorization, and fairness-aware reweighting that preserve agent-specific advantages while ensuring convergence.
Empirical benchmarks demonstrate that specialized communication, adaptive fairness, and structured coordination can mitigate negative transfer and instability in complex multi-agent settings.

Heterogeneous Agent Collaborative Reinforcement Learning (HACRL) denotes a class of collaborative reinforcement learning settings in which non-identical agents learn in ways that exploit complementarity rather than suppress it. In the formulation used by " $\alpha$ -fair heterogeneous agent reinforcement learning" (Xu et al., 11 Jun 2026), HACRL aims to learn decentralized policies for agents with heterogeneous observations, action spaces, and capabilities, while collaborating to achieve socially desirable outcomes. Across the broader literature, the same core idea appears in cooperative Markov games and Dec-POMDPs with distinct capabilities, action spaces, and dynamics (Yu et al., 2023), in actor-critic methods that avoid parameter sharing and instead coordinate sequential heterogeneous updates (Zhong et al., 2023), and in training-time collaboration schemes where agents share verified rollouts or models while retaining independent execution at inference (Zhang et al., 3 Mar 2026). The field therefore spans a common problem: how to preserve the statistical and functional advantages of heterogeneity without incurring instability, negative transfer, or coordination failure.

1. Formal settings and sources of heterogeneity

A large part of HACRL is formulated as either a cooperative Markov game or a heterogeneous Dec-POMDP. In the HARL formulation, a cooperative Markov game is defined by $\langle \mathcal{N}, \mathcal{S}, \boldsymbol{\mathcal{A}}, r, P, \gamma, d\rangle$ , with factored decentralized policies and a joint return $J(\boldsymbol{\pi})$ (Zhong et al., 2023). In GHQ, the environment is a cooperative Dec-POMDP under CTDE, with agent-specific observation spaces $\Omega_i$ and action spaces $A_i$ (Yu et al., 2023). HeMAC makes this heterogeneity explicit at the benchmark level by defining a Dec-POMDP with agent-specific observation and action spaces, mixed discrete and continuous interfaces, and asymmetric dynamics (Dansereau et al., 23 Sep 2025). These formulations share the same structural premise: collaboration is required, but symmetry assumptions are not.

The literature treats heterogeneity as a multidimensional property rather than a single deviation in action count. GHQ characterizes heterogeneity in SMAC through Local Transition Heterogeneity, decomposed into Local Functionality Heterogeneity and Local Dynamic Heterogeneity (Yu et al., 2023). PHLRL models heterogeneous systems with a type set $\Delta=\{d_1,\dots,d_M\}$ , where types differ in abilities and action spaces, and where type imbalance is itself a learning problem (Fu et al., 2024). GRILL extends ad hoc teamwork to heterogeneous goals by assigning each agent a goal subset $\mathcal{G}^i \subset \mathcal{G}$ , so that overlap between agents’ goals may be full, partial, or absent (Taylor-Davies et al., 7 Mar 2026). In LLM-oriented HACRL, heterogeneity is further partitioned into heterogeneous state, heterogeneous size, and heterogeneous model, including tokenizer mismatch and architectural mismatch (Zhang et al., 3 Mar 2026).

This broader view matters because several recurring simplifications are rejected by the literature. HARL argues that full parameter sharing confines learning to homogeneous-agent settings and can be exponentially suboptimal (Zhong et al., 2023). HeMAC likewise cautions against zero-padding or discretization-to-match tricks, arguing that such homogenization increases dimensionality and harms learning efficiency and expressivity (Dansereau et al., 23 Sep 2025). A plausible implication is that HACRL is best understood not as “MARL plus role IDs,” but as a family of formulations in which heterogeneity changes both the optimization geometry and the admissible communication structure.

2. Core algorithmic foundations

The most developed theoretical line in HACRL is the HARL/HATRL/HAML family. HARL introduces the multi-agent advantage decomposition lemma and a sequential update scheme in which agents are updated one-by-one rather than simultaneously (Zhong et al., 2023). The key decomposition writes the joint advantage of an ordered subset of agents as a sum of conditioned per-agent advantages, which makes trust-region optimization tractable in heterogeneous settings. From this construction, HATRL generalizes trust-region learning to cooperative multi-agent systems with heterogeneous agents, while HATRPO and HAPPO provide practical constrained and clipped approximations. HAML then abstracts these algorithms into a mirror-learning template and proves monotonic improvement of joint return and convergence to Nash equilibrium for all algorithms derived within that template (Zhong et al., 2023).

The latest extension of this line is fairness-aware HACRL. " $\alpha$ -fair heterogeneous agent reinforcement learning" replaces the utilitarian objective with an $\alpha$ -fair welfare objective, using

$U_\alpha(r)=\frac{r^{1-\alpha}}{1-\alpha}\quad (\alpha\neq 1), \qquad U_1(r)=\log r,$

and the state-based welfare

$\langle \mathcal{N}, \mathcal{S}, \boldsymbol{\mathcal{A}}, r, P, \gamma, d\rangle$ 0

Its fair advantage function is

$\langle \mathcal{N}, \mathcal{S}, \boldsymbol{\mathcal{A}}, r, P, \gamma, d\rangle$ 1

which reweights gradients toward lower-return agents without modifying the environment reward (Xu et al., 11 Jun 2026). This is important because the paper explicitly contrasts its approach with inequity-aversion reward shaping, arguing that reward shaping may break the Markov property, whereas fair advantage reweighting preserves stationarity and the theoretical guarantees of trust-region optimization (Xu et al., 11 Jun 2026). The resulting $\langle \mathcal{N}, \mathcal{S}, \boldsymbol{\mathcal{A}}, r, P, \gamma, d\rangle$ 2-fair HATRPO and $\langle \mathcal{N}, \mathcal{S}, \boldsymbol{\mathcal{A}}, r, P, \gamma, d\rangle$ 3-fair HAPPO preserve monotonic improvement and convergence to Nash equilibria under the stated assumptions.

A separate but complementary algorithmic line uses grouped factorization. GHQ partitions agents by Local Transition Grouping, learns per-group Q-functions, and enforces Grouped Individual-Global-Max Consistency (GIGM) through monotonic mixing constraints (Yu et al., 2023). Its hybrid structure combines group-level temporal-difference learning with group and global monotonicity, while Inter-group Mutual Information maximization encourages coordination between groups of different types. This suggests a distinct design axis in HACRL: instead of sequential trust-region updates over heterogeneous policies, one may instead impose structure on heterogeneous value factorization and on inter-group representation coupling.

3. Collaboration mechanisms beyond sequential trust regions

Not all HACRL methods rely on CTDE trust-region policy improvement. A second family stabilizes heterogeneity by maintaining policy populations or leagues. Heterogeneous League Training stores a frontier policy group and a league of frozen past policy groups, samples mixed teammates during training, and conditions actors through a hyper-network on team composition and teammate skill (Fu et al., 2022). Prioritized Heterogeneous League Reinforcement Learning adopts a similar league principle but combines it with prioritized advantage coefficients,

$\langle \mathcal{N}, \mathcal{S}, \boldsymbol{\mathcal{A}}, r, P, \gamma, d\rangle$ 4

to compensate for agent-type imbalance in large-scale systems (Fu et al., 2024). In both cases, frozen historical policies mitigate non-stationarity by exposing learners to stable yet diverse teammate behaviors.

A third family emphasizes collaborative exploration and replay rather than direct coupling of policies. CHDRL combines an off-policy global agent with on-policy and evolutionary local agents through cooperative exploration, local-global memory relay, and distinctive updates (Zheng et al., 2020). DiCE replaces a single learner with a team of independent policies that share experience and remain behaviorally distinct through diversity-regularized gradient fusion (Peng et al., 2020). In sparse-reward ViZDoom, collaborative training of heterogeneous PPO agents shares a centralized UVFA critic and, in some variants, centralized count-based intrinsic motivation; action-conditioned curiosity and reproducibility filtering are used to prevent negative transfer when only one agent can execute the critical OPEN action (Andres et al., 2022). These methods do not require a common joint environment trajectory, yet still fall within HACRL because collaboration changes what is explored and what is learned from.

A fourth family centers on model or rollout exchange across heterogeneous learners. Collaborative Deep Reinforcement Learning introduces deep knowledge distillation through a deep alignment network $\langle \mathcal{N}, \mathcal{S}, \boldsymbol{\mathcal{A}}, r, P, \gamma, d\rangle$ 5 that maps teacher logits to a student’s action space, enabling cA3C to transfer across heterogeneous tasks and action spaces (Lin et al., 2017). HFDRL instead performs semantic-aware collaborator selection over a wireless cellular network, using a combined structural-semantic score $\langle \mathcal{N}, \mathcal{S}, \boldsymbol{\mathcal{A}}, r, P, \gamma, d\rangle$ 6 and HeteroFL-style aggregation under communication and delay constraints (Lotfi et al., 2021). GARL studies asynchronous heterogeneous groups consisting of A2C, PPO, and ACER agents, and lets agents learn from one another through action-choice aggregation and full model adoption when trust criteria are met (Wu et al., 21 Jan 2025). HACPO generalizes this idea to RL with verifiable rewards for LLM agents: verified rollouts are reused across heterogeneous policies with sequence-level importance sampling, capability-aware baselines, exponential reweighting, and asymmetric stepwise clipping, while preserving independent inference after training (Zhang et al., 3 Mar 2026).

Taken together, these mechanisms show that collaboration in HACRL need not mean joint-action coupling. It may instead mean sequential policy improvement, grouped factorization, league-based partner randomization, shared replay, curiosity sharing, distillation, federated aggregation, or verified-rollout reuse.

4. Communication, goals, and multimodal coordination

Communication occupies a special place in HACRL because heterogeneity often creates information asymmetry. One line of work specializes communication explicitly through type structure. "Specializing Inter-Agent Communication in Heterogeneous Multi-Agent Reinforcement Learning using Agent Class Information" represents communication as a directed labeled heterogeneous agent graph and parameterizes message transforms by sender-receiver class pairs using relational graph convolutions (Meneghetti et al., 2020). The communication relation type is $\langle \mathcal{N}, \mathcal{S}, \boldsymbol{\mathcal{A}}, r, P, \gamma, d\rangle$ 7, with $\langle \mathcal{N}, \mathcal{S}, \boldsymbol{\mathcal{A}}, r, P, \gamma, d\rangle$ 8 in the reported experiments, so message passing is specialized at the class-pair level rather than being shared across all agents. This is a concrete answer to the problem of class-dependent semantics.

CH-MARL extends this communication problem to multimodal cooperation. It defines a two-robot household benchmark in which a humanoid with egocentric RGB and manipulation capability collaborates with a drone that has top-down RGB but cannot manipulate (Sharma et al., 2022). The task is pick-and-place, the observations may be visual or scene-graph based, and the benchmark includes language feedback. Communication is implemented as symbolic room-level messages about the object and the receptacle. The reported results show that simple communication materially improves decentralized performance, and that scene-graph inputs are substantially easier than raw RGB. A plausible implication is that, in heterogeneous multimodal systems, the hardest problem is often not coordination alone but joint coordination-perception.

Language itself can be the sole communication channel. In collaborative multi-agent dialogue training, two conversational agents with different roles, states, and reward shaping interact only through self-generated language and learn concurrently with WoLF-PHC in a stochastic collaborative game (Papangelis et al., 2019). This work treats NLU and NLG uncertainty as part of the environment dynamics rather than as an auxiliary pre-processing nuisance. The same general issue appears in GRILL, but at the level of goal arbitration: the high-level policy $\langle \mathcal{N}, \mathcal{S}, \boldsymbol{\mathcal{A}}, r, P, \gamma, d\rangle$ 9 decides which goal to pursue, while a low-level goal-conditioned controller executes it, allowing the agent to decide when cooperation is worthwhile under heterogeneous goals (Taylor-Davies et al., 7 Mar 2026). GRILL-M adds an auxiliary teammate-modeling module and reports that its contribution increases as observable information about teammate goals decreases (Taylor-Davies et al., 7 Mar 2026).

These works collectively undermine a common misconception that HACRL is only about assigning specialized controllers to fixed roles. In many settings, the central challenge is not merely how different agents cooperate, but how they infer whether cooperation is currently useful, feasible, or even aligned with their local objective structure.

5. Benchmarks, applications, and empirical regularities

Empirical HACRL research spans synthetic coordination games, continuous control, sparse-reward navigation, wireless systems, and multimodal embodied environments. GHQ augments SMAC with seven new heterogeneous maps involving Marines and Medivacs and reports superior win rate, convergence speed, and variance relative to several value-based baselines as heterogeneity increases (Yu et al., 2023). PHLRL introduces LSOP, an Unreal Engine benchmark with air drones, missile vehicles, and gun vehicles, and reports that it achieves more than 90% win rate against a strong expert baseline within about 60k episodes in many runs (Fu et al., 2024). HeMAC systematizes this benchmarking agenda with Simple Fleet, Fleet, and Complex Fleet challenges, showing that IPPO is comparatively robust while MAPPO’s advantage erodes and QMIX struggles as heterogeneity increases (Dansereau et al., 23 Sep 2025).

A notable empirical regularity is that parameter sharing becomes less reliable as heterogeneity deepens. HARL reports that imposing parameter sharing on HAPPO harms performance, particularly with more heterogeneous agents (Zhong et al., 2023). HeMAC reports that even parameter sharing restricted to agents of the same type does not improve performance in its settings (Dansereau et al., 23 Sep 2025). This does not imply that all sharing is harmful; rather, it suggests that indiscriminate sharing is a poor substitute for structured coordination.

Another recurring pattern is that communication and information shaping can dominate raw algorithm choice. In CH-MARL, visual Test-unseen success for QMIX improves from 2.33% to 12.03% with messages, and scene-graph Test-unseen success improves from 18.67% to 28.95% (Sharma et al., 2022). In the UAV-enabled collaborative beamforming problem, HATRPO-UCB augments HATRPO with observation enhancement, agent-specific global states for critics, and Beta policies for bounded actions, yielding faster convergence and a stronger rate-energy trade-off than several MADRL baselines (Liu et al., 2024). In sparse-reward ViZDoom, action-based centralized curiosity with reproducibility filtering produces the fastest convergence among the reported heterogeneous collaborative variants (Andres et al., 2022).

Fairness is also now an empirical dimension rather than a purely normative one. In sequential social dilemmas such as CleanUp and CommonHarvest, $J(\boldsymbol{\pi})$ 0-fair HATRPO and HAPPO reduce the Gini index while maintaining competitive or slightly higher utilitarian welfare proxies relative to utilitarian HATRL baselines (Xu et al., 11 Jun 2026). More broadly, this suggests that “collaboration quality” in HACRL is not exhausted by return maximization; fairness, resource sustainability, and role utilization are increasingly treated as first-class outcome variables.

6. Limitations and open directions

Despite rapid diversification, much of HACRL remains method-fragmented. The strongest theoretical guarantees are concentrated in HARL/HATRL/HAML and the $J(\boldsymbol{\pi})$ 1-fair extension, and these guarantees rely on assumptions such as finite spaces, bounded positive rewards, and $J(\boldsymbol{\pi})$ 2-soft policies (Zhong et al., 2023, Xu et al., 11 Jun 2026). By contrast, league methods, federated semantic collaboration, and several large-scale heterogeneous actor-critic methods are supported primarily by empirical evidence rather than convergence theory (Fu et al., 2024, Lotfi et al., 2021, Fu et al., 2022). This leaves a gap between practical performance and formal understanding.

Scalability is another unresolved issue. GHQ still reports difficulty on larger heterogeneous SMAC maps (Yu et al., 2023). HeMAC identifies credit assignment under asymmetric dynamics, role specialization, and constrained communication as open design points rather than solved problems (Dansereau et al., 23 Sep 2025). PHLRL reduces computational overhead by training only the frontier policy group, but its guarantees remain empirical (Fu et al., 2024). In LLM HACRL, verified rewards enable principled rollout reuse, yet tokenizer mismatch, capability disparity, and off-policy drift require intricate importance-weighting and clipping machinery (Zhang et al., 3 Mar 2026).

Several open directions recur across papers. One is dynamic structure discovery: dynamic grouping in GHQ, learned role discovery in HeMAC-like settings, and adaptive class assignment in communication-specialized GNNs are all explicit future directions (Yu et al., 2023, Dansereau et al., 23 Sep 2025, Meneghetti et al., 2020). A second is adaptive fairness and exploration control: $J(\boldsymbol{\pi})$ 3 scheduling and adaptive trust regions are proposed in fair HACRL (Xu et al., 11 Jun 2026), while intrinsic-reward scheduling is shown to determine whether optimal routes are ever discovered in sparse-reward heterogeneous exploration (Andres et al., 2022). A third is richer interaction models: mixed cooperative-competitive extensions, explicit communication under stationarity constraints, and online adaptation to changing teammates remain active frontiers (Xu et al., 11 Jun 2026, Taylor-Davies et al., 7 Mar 2026).

A final, broader implication emerges across the literature. HACRL is no longer a niche variant of MARL concerned only with typed agents. It has become a unifying label for methods that treat diversity in capabilities, observations, goals, policies, and model classes as something to coordinate rather than eliminate. The central technical question is therefore not whether agents are heterogeneous, but which aspects of heterogeneity should be preserved, aligned, reweighted, or shared in order to make collaboration both effective and stable.