HRPO: Efficient Multi-Hop Policy Optimization
- HRPO is an on-policy policy-gradient optimizer that clusters multi-hop question-answer pairs by hop count to compute low-variance group-level baselines.
- It reduces computational effort by about 75% compared to GRPO by eliminating nested sampling and expensive tool rollouts.
- HRPO demonstrates comparable or superior performance to GRPO and PPO, offering improved efficiency and stability in self-evolving, data-free LLM frameworks.
Hop-Grouped Relative Policy Optimization (HRPO) is an on-policy policy-gradient optimizer developed to address the computational inefficiencies of training “proposer” LLMs that autonomously generate complex, multi-hop search questions. HRPO is designed specifically for self-evolving agents in data-free environments, where multi-turn search-and-reasoning pipelines present significant computational cost due to expensive tool rollouts. By leveraging the natural “hop” structure inherent in generated reasoning chains—where a “hop” refers to an interleaved search→reasoning step—HRPO clusters questions by their hop count and computes low-variance group-level baselines. This approach eliminates the need for computationally prohibitive nested sampling, enabling substantial reductions in rollout costs while preserving or surpassing the performance and stability of prior policy-gradient optimizers such as Grouped Relative Policy Optimization (GRPO) and Proximal Policy Optimization (PPO) (Yue et al., 11 Jan 2026).
1. Motivation and Underlying Principles
HRPO was introduced to overcome the bottleneck of nested sampling required by standard group-based reinforcement learning methods in multi-hop search agents. In conventional algorithms like GRPO, estimating a low-variance reward baseline demands generating multiple candidate questions per prompt and executing several solver rollouts per candidate, producing a combinatorial explosion in compute as the complexity of the reasoning pipeline grows. Given that each rollout through the search agent pipeline invokes a costly multi-turn interaction, the cumulative computational demands quickly become unsustainable, especially for large-scale or data-free self-evolution settings.
HRPO circumvents these limitations by grouping the generated QA pairs solely according to their hop count. Within each hop-count cluster, a group-level baseline is computed, enabling robust variance reduction without requiring multiple candidates per prompt. This strategy minimizes the sampling overhead for evaluating an individual query’s difficulty and solvability and achieves a rollout reduction by approximately the number of candidates required in GRPO per prompt (“m”), typically about 4, resulting in roughly 75% savings in compute.
2. Mathematical Formulation
Let denote the proposer’s policy parameters, be the finite set of hop counts (e.g., ), and denote the batch size of QA pairs , each with an associated hop count and scalar reward (derived via policy pass rate as solver reward). HRPO’s objective leverages group-level baselines:
Here, is the group-normalized advantage within hop group :
where is a small stabilizer constant (e.g., ). KL regularization with weight enforces conservative policy updates. There is no explicit clustering loss; the only additional cost is the KL divergence term.
3. HRPO Algorithmic Workflow
The HRPO training loop can be summarized as:
- Data Collection: For each in $1$ to , sample prompt , generate via a single rollout, compute reward using the solver’s pass rate, and record hop count .
- Grouping: Partition the batch into groups indexed by hop count , where .
- Baseline Computation: For each non-empty group , compute the mean and variance .
- Advantage Calculation: For each sample, compute the standardized advantage as above.
- Policy Gradient Update: Minimize the objective across all hop groups, apply policy gradient step, and update reference policy .
No value function (“critic”) is used, and there is no ratio-clipping as employed by PPO. Grouping by hop count is computationally negligible since hop counts are byproducts of question generation.
4. Comparison to GRPO and PPO
A summary of efficiency and structural distinctions among HRPO and comparable methods:
| Optimizer | Sampling Cost per Prompt | Variance Reduction | Critic Network | Ratio-Clipping |
|---|---|---|---|---|
| GRPO | Group-level (per-prompt) | None | Optional | |
| PPO | Global baseline | Used | Yes | |
| HRPO | Group-level (per-hop) | None | No |
- Sampling Efficiency: HRPO reduces rollout cost by approximately $1/m$ compared to GRPO, where is the number of candidate questions per prompt. With , HRPO achieves about a reduction.
- Complexity: HRPO requires solver calls per batch, while GRPO requires .
- Variance and Stability: HRPO avoids high variance associated with global REINFORCE and obviates the need for value function fitting required in actor–critic variants, contributing to robust convergence even in the absence of PPO-style ratio clipping.
5. Empirical Outcomes
Empirical results for a $3$B parameter backbone demonstrate that HRPO achieves or exceeds the performance of GRPO:
- Average exact match (EM) score across seven QA benchmarks: HRPO $0.326$ vs. GRPO $0.320$.
- Rollout cost ratio: HRPO/GRPO .
- HRPO offers superior performance on single-hop benchmarks (e.g., NQ: $0.397$ vs. $0.361$).
- GRPO retains a slight advantage on four-hop (most complex) benchmarks, likely due to the additional samples per prompt used in GRPO.
- Overall, HRPO reduces expensive tool-based rollouts by without compromising accuracy or stability.
6. Limitations, Hyperparameters, and Practical Considerations
- Granularity of Grouping: Grouping strictly by hop count provides only a coarse partition; within-group difficulty variance remains.
- Variance in Small Groups: Hop groups with few samples can result in high-variance baseline estimates, especially for rare multi-hop cases.
- Hyperparameters: Key settings include (KL weight, usually ), (advantage stabilizer, typically $0.2$), and learning rates ( to for the proposer, for the solver). Default hop group proportions in synthetic curricula are $4:3:2:1$ for hops $1$ through $4$.
- Compute Requirements: HRPO’s cost savings are particularly significant in settings where tool-invocation rollouts are much costlier than policy updates.
7. Extensions and Potential Applications
HRPO’s design admits potential generalizations:
- Finer-grained structural clustering could be achieved by applying graph embeddings or edit distances on entire reasoning chains, rather than only hop count equality.
- Adaptive grouping, where clusters are periodically updated using a learned similarity metric, may further refine variance reduction.
- A hybrid actor–critic architecture could be introduced, with per-hop group critics for further variance minimization.
- HRPO is applicable to other multi-turn tool-use domains, including code generation and dialogue pipelines, whenever a natural notion of stepwise decision complexity (“hop structure”) exists.
In summary, Hop-Grouped Relative Policy Optimization enables scalable, low-variance training of autonomous question-proposing agents, offering a substantial reduction in compute without sacrificing stability or accuracy, and is particularly well suited for data-free, self-evolving LLM agent frameworks (Yue et al., 11 Jan 2026).