Papers
Topics
Authors
Recent
Search
2000 character limit reached

HACPO: Collaborative Reinforcement Learning

Updated 4 July 2026
  • HACPO is a reinforcement learning framework enabling heterogeneous agents to collaboratively optimize policies by sharing rollouts during training with independent execution at inference.
  • It employs four key mechanisms—capability-aware advantage estimation, discrepancy coefficients, exponential importance sampling, and stepwise clipping—to balance cross-agent learning.
  • Empirical results indicate HACPO outperforms baselines in diverse heterogeneity settings by improving sample efficiency, gradient alignment, and reducing rollout costs.

Searching arXiv for "HACPO" and the cited papers to disambiguate the topic and support the article with current references. HACPO is an acronym that denotes two distinct research objects in recent arXiv literature. In reinforcement learning, HACPO refers to Heterogeneous Agent Collaborative Policy Optimization, the first algorithm proposed within the Heterogeneous Agent Collaborative Reinforcement Learning (HACRL) paradigm introduced in "Heterogeneous Agent Collaborative Reinforcement Learning" (Zhang et al., 3 Mar 2026). In ranking and preference modeling, HCPO is the Hierarchical Clustering Partial-Order model from "Hierarchical Partial-Order Models for Ranking" (Li et al., 23 Jun 2026), and the detailed specification notes that it is "sometimes called 'HACPO' when treated as a clustering model" (Li et al., 23 Jun 2026). The dominant use of HACPO in the supplied material is the reinforcement-learning sense, but the acronym is therefore polysemous and requires contextual disambiguation.

1. Reinforcement-learning meaning of HACPO

Within HACRL, HACPO is defined as Heterogeneous Agent Collaborative Policy Optimization (Zhang et al., 3 Mar 2026). HACRL addresses a setting in which multiple Large-Language-Model agents, "possibly differing in architecture, size, or training state," are trained toward a common, verifiable reward function, share rollouts during training, and are "deployed independently" at inference time (Zhang et al., 3 Mar 2026). The paper characterizes this arrangement as "collaborative optimization with independent execution" (Zhang et al., 3 Mar 2026).

This positioning is important because HACPO is explicitly distinguished from two neighboring traditions. First, unlike LLM-based multi-agent reinforcement learning, it "does not require coordinated deployment" (Zhang et al., 3 Mar 2026). Second, unlike knowledge distillation and related on-/off-policy distillation approaches, it enables "bidirectional mutual learning among heterogeneous agents rather than one-directional teacher-to-student transfer" (Zhang et al., 3 Mar 2026). The algorithm therefore occupies a specific niche: collaborative on-policy optimization across heterogeneous agents, with training-time sharing but inference-time independence.

The paper states that HACPO is "the first algorithm to instantiate HACRL" (Zhang et al., 3 Mar 2026). In the provided description, each agent updates from both its own on-policy data ("homo") and other agents' rollouts ("hete"), with the stated aim of improving "sample efficiency and cross-agent knowledge transfer" (Zhang et al., 3 Mar 2026). This suggests that HACPO is best understood not merely as a modified PPO-style objective, but as a formalization of cross-policy sample reuse under heterogeneous tokenizers, parameterizations, and capabilities.

2. Algorithmic structure and training workflow

The HACPO training loop is given in explicit stepwise form for nn agents (Zhang et al., 3 Mar 2026). Its inputs are agents πθk(k)\pi^{(k)}_{\theta_k}, a shared prompt distribution DD, a verifiable reward function R(y)R(y), rollouts per agent per prompt GG, clipping hyperparameters ϵlow,ϵhigh,δ,δstep\epsilon_{\text{low}}, \epsilon_{\text{high}}, \delta, \delta_{\text{step}}, and an importance-sampling exponent α\alpha (Zhang et al., 3 Mar 2026). At each training step, old policies are saved, a minibatch of prompts is sampled, each agent generates GG responses per prompt, rewards are computed, and a shared reward pool Rt\mathcal{R}_t is formed (Zhang et al., 3 Mar 2026).

The algorithm then computes rolling capability estimates. For agent kk, these are

πθk(k)\pi^{(k)}_{\theta_k}0

which are used to define the capability ratio

πθk(k)\pi^{(k)}_{\theta_k}1

A capability-aware baseline is then formed: πθk(k)\pi^{(k)}_{\theta_k}2 and the corresponding advantage uses the joint reward standard deviation,

πθk(k)\pi^{(k)}_{\theta_k}3

These definitions are given directly in the algorithm specification (Zhang et al., 3 Mar 2026).

For cross-agent updates, the advantage is further rescaled. When the sample comes from another agent πθk(k)\pi^{(k)}_{\theta_k}4,

πθk(k)\pi^{(k)}_{\theta_k}5

whereas on an agent’s own samples it remains πθk(k)\pi^{(k)}_{\theta_k}6 (Zhang et al., 3 Mar 2026). HACPO also computes a sequence-level importance-sampling ratio

πθk(k)\pi^{(k)}_{\theta_k}7

followed by exponential reweighting and asymmetric clipping for cross-agent samples (Zhang et al., 3 Mar 2026). The final surrogate loss for agent πθk(k)\pi^{(k)}_{\theta_k}8 is

πθk(k)\pi^{(k)}_{\theta_k}9

and parameters are updated by gradient descent (Zhang et al., 3 Mar 2026).

A notable structural feature is that the algorithm is written to accommodate agents with "its own tokenizer & parameterization" (Zhang et al., 3 Mar 2026). This is one of the clearest indicators that HACPO is not limited to lightly perturbed replicas of a single base model.

3. The four tailored mechanisms

The HACPO paper organizes its methodological contribution around four named mechanisms (Zhang et al., 3 Mar 2026). These mechanisms are introduced to "mitigate capability discrepancies and policy distribution shifts" while preserving "unbiased advantage estimation and optimization correctness" (Zhang et al., 3 Mar 2026).

Mechanism Definition in the paper Intended role
Agent-Capability-Aware Advantage Estimation Capability-weighted baseline DD0 and advantage DD1 using DD2 Adjusts shared reward normalization across unequal agents
Model Capability Discrepancy Coefficient Cross-agent scaling by DD3 when DD4 Reweights transferred signal by relative capability
Exponential Importance Sampling Reweight DD5 by DD6 for DD7 and DD8 Alters cross-agent ratio behavior under mismatch
Stepwise Clipping Asymmetric clipping DD9 with lower bound R(y)R(y)0 Tightens trust region within the step

The Agent-Capability-Aware Advantage Estimation replaces standard single-model group-relative normalization with a baseline mixed across agents but weighted by capability ratio (Zhang et al., 3 Mar 2026). The detailed description explicitly contrasts this with "Standard single-model group-relative advantage" and then defines the mixed baseline and joint-standard-deviation normalization (Zhang et al., 3 Mar 2026).

The Model Capability Discrepancy Coefficient applies only in the heterogeneous case R(y)R(y)1, multiplying the transferred advantage by R(y)R(y)2 (Zhang et al., 3 Mar 2026). This indicates that transferred gradients are not treated as equally informative across agents of unequal performance.

The Exponential Importance Sampling mechanism defines the sequence-level ratio R(y)R(y)3 and then reweights it for cross-agent samples when R(y)R(y)4 through an exponent R(y)R(y)5 (Zhang et al., 3 Mar 2026). The paper presents this as a specific correction for off-policy mismatch between the receiving agent’s current policy and the source agent’s old policy.

The Stepwise Clipping mechanism uses asymmetric clipping for R(y)R(y)6, with the ratio clamped to R(y)R(y)7, and then tightens the lower bound after each mini-batch update R(y)R(y)8 using

R(y)R(y)9

(Zhang et al., 3 Mar 2026). A plausible implication is that HACPO treats negative policy drift on foreign samples more conservatively than symmetric PPO-style trust-region heuristics would.

4. Theoretical guarantees

The paper states two classes of theoretical results: unbiased advantage estimation and gradient consistency (Zhang et al., 3 Mar 2026).

For unbiased advantage estimation, Theorem 1 is stated as follows: under "standard assumptions (reward model shared, GG0 treated independent of current-batch reward noise)," the capability-aware baseline satisfies

GG1

Corollary 1 then states

GG2

The proof sketch in the description expands GG3 over all agents’ samples and uses an "Assumption (ideal GG4)" so that the capability ratio can be factored out and shown to cancel to the single-agent expected reward (Zhang et al., 3 Mar 2026).

For gradient consistency, Theorem 2 defines GG5 and GG6 as the gradients of the homogeneous and heterogeneous objectives and states that

GG7

The accompanying proof sketch rewrites the heterogeneous expectation using an importance-sampling lemma and introduces a positive weight term GG8 (Zhang et al., 3 Mar 2026). The statement then depends on a "mild assumption that GG9 (which encodes capability ratio and IS¹) is positively correlated with ϵlow,ϵhigh,δ,δstep\epsilon_{\text{low}}, \epsilon_{\text{high}}, \delta, \delta_{\text{step}}0’s alignment to the homo-direction" (Zhang et al., 3 Mar 2026).

These results are narrower than a full global convergence theory. The paper claims "unbiased advantage, positive gradient alignment" as strengths (Zhang et al., 3 Mar 2026), but the assumptions built into Theorem 1 and Theorem 2 delimit the scope of those guarantees. This suggests that HACPO’s theory is aimed at validating the correctness of its collaborative surrogate construction rather than establishing end-to-end optimality.

5. Experimental design and empirical results

The experiments in (Zhang et al., 3 Mar 2026) use 7.5K high-quality MATH problems for training, with each prompt yielding ϵlow,ϵhigh,δ,δstep\epsilon_{\text{low}}, \epsilon_{\text{high}}, \delta, \delta_{\text{step}}1 rollouts, and employ verifiable unit tests or formal checks per math problem as the reward (Zhang et al., 3 Mar 2026). Evaluation is performed on MATH-500, full MATH, GSM8K, AIME2025, AMC23, Minerva, OlympiadBench, with accuracy = fraction of correctly verified solutions; AIME2025 uses best@30, while the others use avg@1 (Zhang et al., 3 Mar 2026).

The paper studies three heterogeneity settings: Heterogeneous State with Qwen3-4B vs Qwen3-4B-Instruct; Heterogeneous Size with Qwen3-1.7B-Base vs Qwen3-4B-Base; and Heterogeneous Model with Qwen3-4B-Base vs Llama3.2-3B-Instruct (Zhang et al., 3 Mar 2026). Baselines are GRPO, GSPO, GSPO×2, and Naive, where Naive is "two-agent rollout share without HACPO’s four mechanisms" (Zhang et al., 3 Mar 2026).

The summary results reported in the paper are as follows.

Heterogeneity Method AVG(A,B)
State (4B vs 4B-Instr) GSPO 0.742
State (4B vs 4B-Instr) GSPO×2 0.745
State (4B vs 4B-Instr) Naive 0.637
State (4B vs 4B-Instr) HACPO 0.784
Size (1.7B vs 4B-Base) GSPO 0.523
Size (1.7B vs 4B-Base) GSPO×2 0.525
Size (1.7B vs 4B-Base) Naive 0.476
Size (1.7B vs 4B-Base) HACPO 0.547
Model (Qwen3-4B vs Llama3.2-3B) GSPO 0.465
Model (Qwen3-4B vs Llama3.2-3B) GSPO×2 0.455
Model (Qwen3-4B vs Llama3.2-3B) Naive 0.410
Model (Qwen3-4B vs Llama3.2-3B) HACPO 0.494

The paper states that HACPO "consistently improves all participating agents" and "outperform[s] GSPO by an average of 3.3\% while using only half the rollout cost" (Zhang et al., 3 Mar 2026). In the tabulated summary, the overall average gain is given as +3.3 points over GSPO (Zhang et al., 3 Mar 2026). It further states that HACPO uses ϵlow,ϵhigh,δ,δstep\epsilon_{\text{low}}, \epsilon_{\text{high}}, \delta, \delta_{\text{step}}2 rollouts per agent, for total ϵlow,ϵhigh,δ,δstep\epsilon_{\text{low}}, \epsilon_{\text{high}}, \delta, \delta_{\text{step}}3, but each agent reuses all ϵlow,ϵhigh,δ,δstep\epsilon_{\text{low}}, \epsilon_{\text{high}}, \delta, \delta_{\text{step}}4 samples, making it "equivalent to GSPO×2’s data budget but at half physical sampling cost" (Zhang et al., 3 Mar 2026).

An important negative result is that the Naive rollout-sharing baseline performs substantially worse in every reported heterogeneity setting (Zhang et al., 3 Mar 2026). This supports the paper’s claim that raw rollout pooling is insufficient and that the four mechanisms are not incidental engineering details.

6. Strengths, limitations, and plausible extensions

The strengths section in (Zhang et al., 3 Mar 2026) lists four properties. Sample Efficiency is attributed to the fact that "each rollout is reused across agents (up to ϵlow,ϵhigh,δ,δstep\epsilon_{\text{low}}, \epsilon_{\text{high}}, \delta, \delta_{\text{step}}5 times)." Bidirectional Transfer is motivated by the claim that "even weaker agents contribute unique signals (errors, alternative proofs)." Theoretically Sound refers to "unbiased advantage, positive gradient alignment." Heterogeneity-Robust refers to operation across "state, size, and model heterogeneity" (Zhang et al., 3 Mar 2026).

The limitations section is equally explicit. Hyperparameter Sensitivity is attached to ϵlow,ϵhigh,δ,δstep\epsilon_{\text{low}}, \epsilon_{\text{high}}, \delta, \delta_{\text{step}}6, ϵlow,ϵhigh,δ,δstep\epsilon_{\text{low}}, \epsilon_{\text{high}}, \delta, \delta_{\text{step}}7, and ϵlow,ϵhigh,δ,δstep\epsilon_{\text{low}}, \epsilon_{\text{high}}, \delta, \delta_{\text{step}}8, which "must be tuned per agent pair" (Zhang et al., 3 Mar 2026). IS Variance is identified as a constraint under large policy gaps, which "may force very conservative ϵlow,ϵhigh,δ,δstep\epsilon_{\text{low}}, \epsilon_{\text{high}}, \delta, \delta_{\text{step}}9 or heavy clipping" (Zhang et al., 3 Mar 2026). Scalability to Many Agents is described as a concern because controlling pairwise α\alpha0 and importance sampling "might become complex as α\alpha1 grows" (Zhang et al., 3 Mar 2026).

The paper also enumerates possible extensions: Adaptive α\alpha2 and α\alpha3 per agent pair, based on online drift estimates; Hierarchical Collaboration: factor graphs to select most informative cross-agent rollouts; Single-Agent Distillation: use HACPO to derive a single "ensemble" policy after collaborative training; and application to standard continuous-control MARL, replacing verifiable rewards with learned critics while retaining heterogeneous trajectory sharing (Zhang et al., 3 Mar 2026).

These proposals remain prospective in the source. A plausible implication is that HACPO is presently most mature in settings with verifiable rewards and moderate numbers of heterogeneous agents, rather than in large-agent systems with highly nonstationary reward estimation.

7. Alternative meaning: HCPO/HACPO in hierarchical partial-order ranking

A second, unrelated use of the acronym appears in "Hierarchical Partial-Order Models for Ranking" (Li et al., 23 Jun 2026). That paper introduces hierarchical partial order (HPO) models and a clustering extension called HCPO, with the note that HCPO is "sometimes called 'HACPO' when treated as a clustering model" (Li et al., 23 Jun 2026). This usage belongs to Bayesian ranking and preference aggregation rather than reinforcement learning.

In that framework, the universe of items is α\alpha4, a partial order α\alpha5 is a binary relation satisfying irreflexivity, antisymmetry, and transitivity, and α\alpha6 denotes the set of linear extensions of α\alpha7 (Li et al., 23 Jun 2026). HCPO constructs a hierarchy consisting of a global poset α\alpha8 and cluster-specific posets α\alpha9, coupled through latent utilities GG0 and shrinkage governed by GG1 (Li et al., 23 Jun 2026). The model includes a noise-free poset-extension likelihood

GG2

as well as faster alternatives such as weighted queue-jump and frontier-softmax (Li et al., 23 Jun 2026).

Inference is performed by MCMC over GG3-space, using Gibbs steps for root utilities, truncated-Gaussian or MH updates for leaf utilities, MH updates for GG4 and GG5, and Gibbs/CRP updates for cluster labels under a Pitman–Yor or Dirichlet-process-style prior (Li et al., 23 Jun 2026). The paper states that experiments on synthetic and real-world data, "including pairwise acoustic preference data and LLM agent traces," show that HPO and HCPO "outperform existing approaches in both predictive performance and structural interpretability" (Li et al., 23 Jun 2026).

Because this ranking-model usage is methodologically independent from Heterogeneous Agent Collaborative Policy Optimization, the acronym HACPO should not be interpreted without domain context. In contemporary arXiv usage, HACPO most directly denotes the collaborative reinforcement-learning algorithm of (Zhang et al., 3 Mar 2026), while (Li et al., 23 Jun 2026) documents a separate naming collision arising from HCPO in hierarchical partial-order modeling.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HACPO.