HACPO: Collaborative Reinforcement Learning

Updated 4 July 2026

HACPO is a reinforcement learning framework enabling heterogeneous agents to collaboratively optimize policies by sharing rollouts during training with independent execution at inference.
It employs four key mechanisms—capability-aware advantage estimation, discrepancy coefficients, exponential importance sampling, and stepwise clipping—to balance cross-agent learning.
Empirical results indicate HACPO outperforms baselines in diverse heterogeneity settings by improving sample efficiency, gradient alignment, and reducing rollout costs.

Searching arXiv for "HACPO" and the cited papers to disambiguate the topic and support the article with current references. HACPO is an acronym that denotes two distinct research objects in recent arXiv literature. In reinforcement learning, HACPO refers to Heterogeneous Agent Collaborative Policy Optimization, the first algorithm proposed within the Heterogeneous Agent Collaborative Reinforcement Learning (HACRL) paradigm introduced in "Heterogeneous Agent Collaborative Reinforcement Learning" (Zhang et al., 3 Mar 2026). In ranking and preference modeling, HCPO is the Hierarchical Clustering Partial-Order model from "Hierarchical Partial-Order Models for Ranking" (Li et al., 23 Jun 2026), and the detailed specification notes that it is "sometimes called 'HACPO' when treated as a clustering model" (Li et al., 23 Jun 2026). The dominant use of HACPO in the supplied material is the reinforcement-learning sense, but the acronym is therefore polysemous and requires contextual disambiguation.

1. Reinforcement-learning meaning of HACPO

Within HACRL, HACPO is defined as Heterogeneous Agent Collaborative Policy Optimization (Zhang et al., 3 Mar 2026). HACRL addresses a setting in which multiple Large-Language-Model agents, "possibly differing in architecture, size, or training state," are trained toward a common, verifiable reward function, share rollouts during training, and are "deployed independently" at inference time (Zhang et al., 3 Mar 2026). The paper characterizes this arrangement as "collaborative optimization with independent execution" (Zhang et al., 3 Mar 2026).

This positioning is important because HACPO is explicitly distinguished from two neighboring traditions. First, unlike LLM-based multi-agent reinforcement learning, it "does not require coordinated deployment" (Zhang et al., 3 Mar 2026). Second, unlike knowledge distillation and related on-/off-policy distillation approaches, it enables "bidirectional mutual learning among heterogeneous agents rather than one-directional teacher-to-student transfer" (Zhang et al., 3 Mar 2026). The algorithm therefore occupies a specific niche: collaborative on-policy optimization across heterogeneous agents, with training-time sharing but inference-time independence.

The paper states that HACPO is "the first algorithm to instantiate HACRL" (Zhang et al., 3 Mar 2026). In the provided description, each agent updates from both its own on-policy data ("homo") and other agents' rollouts ("hete"), with the stated aim of improving "sample efficiency and cross-agent knowledge transfer" (Zhang et al., 3 Mar 2026). This suggests that HACPO is best understood not merely as a modified PPO-style objective, but as a formalization of cross-policy sample reuse under heterogeneous tokenizers, parameterizations, and capabilities.

2. Algorithmic structure and training workflow

The HACPO training loop is given in explicit stepwise form for $n$ agents (Zhang et al., 3 Mar 2026). Its inputs are agents $\pi^{(k)}_{\theta_k}$ , a shared prompt distribution $D$ , a verifiable reward function $R(y)$ , rollouts per agent per prompt $G$ , clipping hyperparameters $\epsilon_{\text{low}}, \epsilon_{\text{high}}, \delta, \delta_{\text{step}}$ , and an importance-sampling exponent $\alpha$ (Zhang et al., 3 Mar 2026). At each training step, old policies are saved, a minibatch of prompts is sampled, each agent generates $G$ responses per prompt, rewards are computed, and a shared reward pool $\mathcal{R}_t$ is formed (Zhang et al., 3 Mar 2026).

The algorithm then computes rolling capability estimates. For agent $k$ , these are

$\pi^{(k)}_{\theta_k}$ 0

which are used to define the capability ratio

$\pi^{(k)}_{\theta_k}$ 1

A capability-aware baseline is then formed: $\pi^{(k)}_{\theta_k}$ 2 and the corresponding advantage uses the joint reward standard deviation,

$\pi^{(k)}_{\theta_k}$ 3

These definitions are given directly in the algorithm specification (Zhang et al., 3 Mar 2026).

For cross-agent updates, the advantage is further rescaled. When the sample comes from another agent $\pi^{(k)}_{\theta_k}$ 4,

$\pi^{(k)}_{\theta_k}$ 5

whereas on an agent’s own samples it remains $\pi^{(k)}_{\theta_k}$ 6 (Zhang et al., 3 Mar 2026). HACPO also computes a sequence-level importance-sampling ratio

$\pi^{(k)}_{\theta_k}$ 7

followed by exponential reweighting and asymmetric clipping for cross-agent samples (Zhang et al., 3 Mar 2026). The final surrogate loss for agent $\pi^{(k)}_{\theta_k}$ 8 is

$\pi^{(k)}_{\theta_k}$ 9

and parameters are updated by gradient descent (Zhang et al., 3 Mar 2026).

A notable structural feature is that the algorithm is written to accommodate agents with "its own tokenizer & parameterization" (Zhang et al., 3 Mar 2026). This is one of the clearest indicators that HACPO is not limited to lightly perturbed replicas of a single base model.

3. The four tailored mechanisms

The HACPO paper organizes its methodological contribution around four named mechanisms (Zhang et al., 3 Mar 2026). These mechanisms are introduced to "mitigate capability discrepancies and policy distribution shifts" while preserving "unbiased advantage estimation and optimization correctness" (Zhang et al., 3 Mar 2026).

Mechanism	Definition in the paper	Intended role
Agent-Capability-Aware Advantage Estimation	Capability-weighted baseline $D$ 0 and advantage $D$ 1 using $D$ 2	Adjusts shared reward normalization across unequal agents
Model Capability Discrepancy Coefficient	Cross-agent scaling by $D$ 3 when $D$ 4	Reweights transferred signal by relative capability
Exponential Importance Sampling	Reweight $D$ 5 by $D$ 6 for $D$ 7 and $D$ 8	Alters cross-agent ratio behavior under mismatch
Stepwise Clipping	Asymmetric clipping $D$ 9 with lower bound $R(y)$ 0	Tightens trust region within the step

The Agent-Capability-Aware Advantage Estimation replaces standard single-model group-relative normalization with a baseline mixed across agents but weighted by capability ratio (Zhang et al., 3 Mar 2026). The detailed description explicitly contrasts this with "Standard single-model group-relative advantage" and then defines the mixed baseline and joint-standard-deviation normalization (Zhang et al., 3 Mar 2026).

The Model Capability Discrepancy Coefficient applies only in the heterogeneous case $R(y)$ 1, multiplying the transferred advantage by $R(y)$ 2 (Zhang et al., 3 Mar 2026). This indicates that transferred gradients are not treated as equally informative across agents of unequal performance.

The Exponential Importance Sampling mechanism defines the sequence-level ratio $R(y)$ 3 and then reweights it for cross-agent samples when $R(y)$ 4 through an exponent $R(y)$ 5 (Zhang et al., 3 Mar 2026). The paper presents this as a specific correction for off-policy mismatch between the receiving agent’s current policy and the source agent’s old policy.

The Stepwise Clipping mechanism uses asymmetric clipping for $R(y)$ 6, with the ratio clamped to $R(y)$ 7, and then tightens the lower bound after each mini-batch update $R(y)$ 8 using

$R(y)$ 9

(Zhang et al., 3 Mar 2026). A plausible implication is that HACPO treats negative policy drift on foreign samples more conservatively than symmetric PPO-style trust-region heuristics would.

4. Theoretical guarantees

The paper states two classes of theoretical results: unbiased advantage estimation and gradient consistency (Zhang et al., 3 Mar 2026).

For unbiased advantage estimation, Theorem 1 is stated as follows: under "standard assumptions (reward model shared, $G$ 0 treated independent of current-batch reward noise)," the capability-aware baseline satisfies

$G$ 1

Corollary 1 then states

$G$ 2

The proof sketch in the description expands $G$ 3 over all agents’ samples and uses an "Assumption (ideal $G$ 4)" so that the capability ratio can be factored out and shown to cancel to the single-agent expected reward (Zhang et al., 3 Mar 2026).

For gradient consistency, Theorem 2 defines $G$ 5 and $G$ 6 as the gradients of the homogeneous and heterogeneous objectives and states that

$G$ 7

The accompanying proof sketch rewrites the heterogeneous expectation using an importance-sampling lemma and introduces a positive weight term $G$ 8 (Zhang et al., 3 Mar 2026). The statement then depends on a "mild assumption that $G$ 9 (which encodes capability ratio and IS¹) is positively correlated with $\epsilon_{\text{low}}, \epsilon_{\text{high}}, \delta, \delta_{\text{step}}$ 0’s alignment to the homo-direction" (Zhang et al., 3 Mar 2026).

These results are narrower than a full global convergence theory. The paper claims "unbiased advantage, positive gradient alignment" as strengths (Zhang et al., 3 Mar 2026), but the assumptions built into Theorem 1 and Theorem 2 delimit the scope of those guarantees. This suggests that HACPO’s theory is aimed at validating the correctness of its collaborative surrogate construction rather than establishing end-to-end optimality.

5. Experimental design and empirical results

The experiments in (Zhang et al., 3 Mar 2026) use 7.5K high-quality MATH problems for training, with each prompt yielding $\epsilon_{\text{low}}, \epsilon_{\text{high}}, \delta, \delta_{\text{step}}$ 1 rollouts, and employ verifiable unit tests or formal checks per math problem as the reward (Zhang et al., 3 Mar 2026). Evaluation is performed on MATH-500, full MATH, GSM8K, AIME2025, AMC23, Minerva, OlympiadBench, with accuracy = fraction of correctly verified solutions; AIME2025 uses best@30, while the others use avg@1 (Zhang et al., 3 Mar 2026).

The paper studies three heterogeneity settings: Heterogeneous State with Qwen3-4B vs Qwen3-4B-Instruct; Heterogeneous Size with Qwen3-1.7B-Base vs Qwen3-4B-Base; and Heterogeneous Model with Qwen3-4B-Base vs Llama3.2-3B-Instruct (Zhang et al., 3 Mar 2026). Baselines are GRPO, GSPO, GSPO×2, and Naive, where Naive is "two-agent rollout share without HACPO’s four mechanisms" (Zhang et al., 3 Mar 2026).

The summary results reported in the paper are as follows.

Heterogeneity	Method	AVG(A,B)
State (4B vs 4B-Instr)	GSPO	0.742
State (4B vs 4B-Instr)	GSPO×2	0.745
State (4B vs 4B-Instr)	Naive	0.637
State (4B vs 4B-Instr)	HACPO	0.784
Size (1.7B vs 4B-Base)	GSPO	0.523
Size (1.7B vs 4B-Base)	GSPO×2	0.525
Size (1.7B vs 4B-Base)	Naive	0.476
Size (1.7B vs 4B-Base)	HACPO	0.547
Model (Qwen3-4B vs Llama3.2-3B)	GSPO	0.465
Model (Qwen3-4B vs Llama3.2-3B)	GSPO×2	0.455
Model (Qwen3-4B vs Llama3.2-3B)	Naive	0.410
Model (Qwen3-4B vs Llama3.2-3B)	HACPO	0.494

The paper states that HACPO "consistently improves all participating agents" and "outperform[s] GSPO by an average of 3.3\% while using only half the rollout cost" (Zhang et al., 3 Mar 2026). In the tabulated summary, the overall average gain is given as +3.3 points over GSPO (Zhang et al., 3 Mar 2026). It further states that HACPO uses $\epsilon_{\text{low}}, \epsilon_{\text{high}}, \delta, \delta_{\text{step}}$ 2 rollouts per agent, for total $\epsilon_{\text{low}}, \epsilon_{\text{high}}, \delta, \delta_{\text{step}}$ 3, but each agent reuses all $\epsilon_{\text{low}}, \epsilon_{\text{high}}, \delta, \delta_{\text{step}}$ 4 samples, making it "equivalent to GSPO×2’s data budget but at half physical sampling cost" (Zhang et al., 3 Mar 2026).

An important negative result is that the Naive rollout-sharing baseline performs substantially worse in every reported heterogeneity setting (Zhang et al., 3 Mar 2026). This supports the paper’s claim that raw rollout pooling is insufficient and that the four mechanisms are not incidental engineering details.

6. Strengths, limitations, and plausible extensions

The strengths section in (Zhang et al., 3 Mar 2026) lists four properties. Sample Efficiency is attributed to the fact that "each rollout is reused across agents (up to $\epsilon_{\text{low}}, \epsilon_{\text{high}}, \delta, \delta_{\text{step}}$ 5 times)." Bidirectional Transfer is motivated by the claim that "even weaker agents contribute unique signals (errors, alternative proofs)." Theoretically Sound refers to "unbiased advantage, positive gradient alignment." Heterogeneity-Robust refers to operation across "state, size, and model heterogeneity" (Zhang et al., 3 Mar 2026).

The limitations section is equally explicit. Hyperparameter Sensitivity is attached to $\epsilon_{\text{low}}, \epsilon_{\text{high}}, \delta, \delta_{\text{step}}$ 6, $\epsilon_{\text{low}}, \epsilon_{\text{high}}, \delta, \delta_{\text{step}}$ 7, and $\epsilon_{\text{low}}, \epsilon_{\text{high}}, \delta, \delta_{\text{step}}$ 8, which "must be tuned per agent pair" (Zhang et al., 3 Mar 2026). IS Variance is identified as a constraint under large policy gaps, which "may force very conservative $\epsilon_{\text{low}}, \epsilon_{\text{high}}, \delta, \delta_{\text{step}}$ 9 or heavy clipping" (Zhang et al., 3 Mar 2026). Scalability to Many Agents is described as a concern because controlling pairwise $\alpha$ 0 and importance sampling "might become complex as $\alpha$ 1 grows" (Zhang et al., 3 Mar 2026).

The paper also enumerates possible extensions: Adaptive $\alpha$ 2 and $\alpha$ 3 per agent pair, based on online drift estimates; Hierarchical Collaboration: factor graphs to select most informative cross-agent rollouts; Single-Agent Distillation: use HACPO to derive a single "ensemble" policy after collaborative training; and application to standard continuous-control MARL, replacing verifiable rewards with learned critics while retaining heterogeneous trajectory sharing (Zhang et al., 3 Mar 2026).

These proposals remain prospective in the source. A plausible implication is that HACPO is presently most mature in settings with verifiable rewards and moderate numbers of heterogeneous agents, rather than in large-agent systems with highly nonstationary reward estimation.

7. Alternative meaning: HCPO/HACPO in hierarchical partial-order ranking

A second, unrelated use of the acronym appears in "Hierarchical Partial-Order Models for Ranking" (Li et al., 23 Jun 2026). That paper introduces hierarchical partial order (HPO) models and a clustering extension called HCPO, with the note that HCPO is "sometimes called 'HACPO' when treated as a clustering model" (Li et al., 23 Jun 2026). This usage belongs to Bayesian ranking and preference aggregation rather than reinforcement learning.

In that framework, the universe of items is $\alpha$ 4, a partial order $\alpha$ 5 is a binary relation satisfying irreflexivity, antisymmetry, and transitivity, and $\alpha$ 6 denotes the set of linear extensions of $\alpha$ 7 (Li et al., 23 Jun 2026). HCPO constructs a hierarchy consisting of a global poset $\alpha$ 8 and cluster-specific posets $\alpha$ 9, coupled through latent utilities $G$ 0 and shrinkage governed by $G$ 1 (Li et al., 23 Jun 2026). The model includes a noise-free poset-extension likelihood

$G$ 2

as well as faster alternatives such as weighted queue-jump and frontier-softmax (Li et al., 23 Jun 2026).

Inference is performed by MCMC over $G$ 3-space, using Gibbs steps for root utilities, truncated-Gaussian or MH updates for leaf utilities, MH updates for $G$ 4 and $G$ 5, and Gibbs/CRP updates for cluster labels under a Pitman–Yor or Dirichlet-process-style prior (Li et al., 23 Jun 2026). The paper states that experiments on synthetic and real-world data, "including pairwise acoustic preference data and LLM agent traces," show that HPO and HCPO "outperform existing approaches in both predictive performance and structural interpretability" (Li et al., 23 Jun 2026).

Because this ranking-model usage is methodologically independent from Heterogeneous Agent Collaborative Policy Optimization, the acronym HACPO should not be interpreted without domain context. In contemporary arXiv usage, HACPO most directly denotes the collaborative reinforcement-learning algorithm of (Zhang et al., 3 Mar 2026), while (Li et al., 23 Jun 2026) documents a separate naming collision arising from HCPO in hierarchical partial-order modeling.

Markdown Report Issue Upgrade to Chat

References (2)

Heterogeneous Agent Collaborative Reinforcement Learning (2026)

Hierarchical Partial-Order Models for Ranking (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HACPO.