Papers
Topics
Authors
Recent
2000 character limit reached

HRPO: Efficient Multi-Hop Policy Optimization

Updated 13 January 2026
  • HRPO is an on-policy policy-gradient optimizer that clusters multi-hop question-answer pairs by hop count to compute low-variance group-level baselines.
  • It reduces computational effort by about 75% compared to GRPO by eliminating nested sampling and expensive tool rollouts.
  • HRPO demonstrates comparable or superior performance to GRPO and PPO, offering improved efficiency and stability in self-evolving, data-free LLM frameworks.

Hop-Grouped Relative Policy Optimization (HRPO) is an on-policy policy-gradient optimizer developed to address the computational inefficiencies of training “proposer” LLMs that autonomously generate complex, multi-hop search questions. HRPO is designed specifically for self-evolving agents in data-free environments, where multi-turn search-and-reasoning pipelines present significant computational cost due to expensive tool rollouts. By leveraging the natural “hop” structure inherent in generated reasoning chains—where a “hop” refers to an interleaved search→reasoning step—HRPO clusters questions by their hop count and computes low-variance group-level baselines. This approach eliminates the need for computationally prohibitive nested sampling, enabling substantial reductions in rollout costs while preserving or surpassing the performance and stability of prior policy-gradient optimizers such as Grouped Relative Policy Optimization (GRPO) and Proximal Policy Optimization (PPO) (Yue et al., 11 Jan 2026).

1. Motivation and Underlying Principles

HRPO was introduced to overcome the bottleneck of nested sampling required by standard group-based reinforcement learning methods in multi-hop search agents. In conventional algorithms like GRPO, estimating a low-variance reward baseline demands generating multiple candidate questions per prompt and executing several solver rollouts per candidate, producing a combinatorial explosion in compute as the complexity of the reasoning pipeline grows. Given that each rollout through the search agent pipeline invokes a costly multi-turn interaction, the cumulative computational demands quickly become unsustainable, especially for large-scale or data-free self-evolution settings.

HRPO circumvents these limitations by grouping the generated QA pairs solely according to their hop count. Within each hop-count cluster, a group-level baseline is computed, enabling robust variance reduction without requiring multiple candidates per prompt. This strategy minimizes the sampling overhead for evaluating an individual query’s difficulty and solvability and achieves a rollout reduction by approximately the number of candidates required in GRPO per prompt (“m”), typically about 4, resulting in roughly 75% savings in compute.

2. Mathematical Formulation

Let θ\theta denote the proposer’s policy parameters, HH be the finite set of hop counts (e.g., H={1,2,3,4}H = \{1,2,3,4\}), and NN denote the batch size of QA pairs (xi,yi)(x_i, y_i), each with an associated hop count hiHh_i \in H and scalar reward rir_i (derived via policy pass rate as solver reward). HRPO’s objective leverages group-level baselines:

J(θ)=E(xi,yi)πθ(R),  ri=r(xi,yi)[1NhHi:hi=hlogπθ(yixi,R)  Ai,h]    βDKL[πθ    πθold].J(\theta) = \mathbb{E}_{(x_i, y_i) \sim \pi_\theta(\cdot \mid R), \; r_i = r(x_i, y_i)}\Biggl[\, \frac{1}{N}\sum_{h\in H}\sum_{i:\,h_i = h} \log\pi_\theta(y_i\mid x_i, R)\;A_{i,h} \Biggr] \;-\;\beta\,D_{\mathrm{KL}}\bigl[\pi_\theta\;\|\;\pi_{\theta_\text{old}}\bigr].

Here, Ai,hA_{i,h} is the group-normalized advantage within hop group hh:

Ai,h=riEj:hj=h[rj]Varj:hj=h[rj]+εA_{i,h} = \frac{r_i - \mathbb{E}_{j:\,h_j=h}[r_j]}{\sqrt{\mathrm{Var}_{j:\,h_j=h}[r_j]} + \varepsilon}

where ε\varepsilon is a small stabilizer constant (e.g., 1e81\text{e}{-8}). KL regularization with weight β\beta enforces conservative policy updates. There is no explicit clustering loss; the only additional cost is the KL divergence term.

3. HRPO Algorithmic Workflow

The HRPO training loop can be summarized as:

  1. Data Collection: For each ii in $1$ to NN, sample prompt xix_i, generate yiy_i via a single rollout, compute reward rir_i using the solver’s pass rate, and record hop count hih_i.
  2. Grouping: Partition the batch into groups GhG_h indexed by hop count hh, where Gh={i:hi=h}G_h = \{i : h_i = h\}.
  3. Baseline Computation: For each non-empty group GhG_h, compute the mean μh=meaniGh(ri)\mu_h = \mathrm{mean}_{i \in G_h}(r_i) and variance σh2=VariGh(ri)\sigma^2_h = \mathrm{Var}_{i \in G_h}(r_i).
  4. Advantage Calculation: For each sample, compute the standardized advantage Ai,hiA_{i, h_i} as above.
  5. Policy Gradient Update: Minimize the objective across all hop groups, apply policy gradient step, and update reference policy θold\theta_\text{old}.

No value function (“critic”) is used, and there is no ratio-clipping as employed by PPO. Grouping by hop count is computationally negligible since hop counts are byproducts of question generation.

4. Comparison to GRPO and PPO

A summary of efficiency and structural distinctions among HRPO and comparable methods:

Optimizer Sampling Cost per Prompt Variance Reduction Critic Network Ratio-Clipping
GRPO O(mn)O(m \cdot n) Group-level (per-prompt) None Optional
PPO O(n)O(n) Global baseline Used Yes
HRPO O(n)O(n) Group-level (per-hop) None No
  • Sampling Efficiency: HRPO reduces rollout cost by approximately $1/m$ compared to GRPO, where mm is the number of candidate questions per prompt. With m=4m=4, HRPO achieves about a 75%75\% reduction.
  • Complexity: HRPO requires O(Nn)O(N\cdot n) solver calls per batch, while GRPO requires O(Nmn)O(N\cdot m \cdot n).
  • Variance and Stability: HRPO avoids high variance associated with global REINFORCE and obviates the need for value function fitting required in actor–critic variants, contributing to robust convergence even in the absence of PPO-style ratio clipping.

5. Empirical Outcomes

Empirical results for a $3$B parameter backbone demonstrate that HRPO achieves or exceeds the performance of GRPO:

  • Average exact match (EM) score across seven QA benchmarks: HRPO $0.326$ vs. GRPO $0.320$.
  • Rollout cost ratio: HRPO/GRPO 1/4\approx 1/4.
  • HRPO offers superior performance on single-hop benchmarks (e.g., NQ: $0.397$ vs. $0.361$).
  • GRPO retains a slight advantage on four-hop (most complex) benchmarks, likely due to the additional samples per prompt used in GRPO.
  • Overall, HRPO reduces expensive tool-based rollouts by 75%\approx 75\% without compromising accuracy or stability.

6. Limitations, Hyperparameters, and Practical Considerations

  • Granularity of Grouping: Grouping strictly by hop count provides only a coarse partition; within-group difficulty variance remains.
  • Variance in Small Groups: Hop groups with few samples can result in high-variance baseline estimates, especially for rare multi-hop cases.
  • Hyperparameters: Key settings include β\beta (KL weight, usually 10310^{-3}), ϵ\epsilon (advantage stabilizer, typically $0.2$), and learning rates (5×1075 \times 10^{-7} to 1×1061 \times 10^{-6} for the proposer, 1×1061 \times 10^{-6} for the solver). Default hop group proportions in synthetic curricula are $4:3:2:1$ for hops $1$ through $4$.
  • Compute Requirements: HRPO’s cost savings are particularly significant in settings where tool-invocation rollouts are much costlier than policy updates.

7. Extensions and Potential Applications

HRPO’s design admits potential generalizations:

  • Finer-grained structural clustering could be achieved by applying graph embeddings or edit distances on entire reasoning chains, rather than only hop count equality.
  • Adaptive grouping, where clusters are periodically updated using a learned similarity metric, may further refine variance reduction.
  • A hybrid actor–critic architecture could be introduced, with per-hop group critics for further variance minimization.
  • HRPO is applicable to other multi-turn tool-use domains, including code generation and dialogue pipelines, whenever a natural notion of stepwise decision complexity (“hop structure”) exists.

In summary, Hop-Grouped Relative Policy Optimization enables scalable, low-variance training of autonomous question-proposing agents, offering a substantial reduction in compute without sacrificing stability or accuracy, and is particularly well suited for data-free, self-evolving LLM agent frameworks (Yue et al., 11 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Hop-Grouped Relative Policy Optimization (HRPO).