HRPO: Efficient Multi-Hop Policy Optimization

Updated 13 January 2026

HRPO is an on-policy policy-gradient optimizer that clusters multi-hop question-answer pairs by hop count to compute low-variance group-level baselines.
It reduces computational effort by about 75% compared to GRPO by eliminating nested sampling and expensive tool rollouts.
HRPO demonstrates comparable or superior performance to GRPO and PPO, offering improved efficiency and stability in self-evolving, data-free LLM frameworks.

Hop-Grouped Relative Policy Optimization (HRPO) is an on-policy policy-gradient optimizer developed to address the computational inefficiencies of training “proposer” LLMs that autonomously generate complex, multi-hop search questions. HRPO is designed specifically for self-evolving agents in data-free environments, where multi-turn search-and-reasoning pipelines present significant computational cost due to expensive tool rollouts. By leveraging the natural “hop” structure inherent in generated reasoning chains—where a “hop” refers to an interleaved search→reasoning step—HRPO clusters questions by their hop count and computes low-variance group-level baselines. This approach eliminates the need for computationally prohibitive nested sampling, enabling substantial reductions in rollout costs while preserving or surpassing the performance and stability of prior policy-gradient optimizers such as Grouped Relative Policy Optimization (GRPO) and Proximal Policy Optimization (PPO) (Yue et al., 11 Jan 2026).

1. Motivation and Underlying Principles

HRPO was introduced to overcome the bottleneck of nested sampling required by standard group-based reinforcement learning methods in multi-hop search agents. In conventional algorithms like GRPO, estimating a low-variance reward baseline demands generating multiple candidate questions per prompt and executing several solver rollouts per candidate, producing a combinatorial explosion in compute as the complexity of the reasoning pipeline grows. Given that each rollout through the search agent pipeline invokes a costly multi-turn interaction, the cumulative computational demands quickly become unsustainable, especially for large-scale or data-free self-evolution settings.

HRPO circumvents these limitations by grouping the generated QA pairs solely according to their hop count. Within each hop-count cluster, a group-level baseline is computed, enabling robust variance reduction without requiring multiple candidates per prompt. This strategy minimizes the sampling overhead for evaluating an individual query’s difficulty and solvability and achieves a rollout reduction by approximately the number of candidates required in GRPO per prompt (“m”), typically about 4, resulting in roughly 75% savings in compute.

2. Mathematical Formulation

Let $\theta$ denote the proposer’s policy parameters, $H$ be the finite set of hop counts (e.g., $H = \{1,2,3,4\}$ ), and $N$ denote the batch size of QA pairs $(x_i, y_i)$ , each with an associated hop count $h_i \in H$ and scalar reward $r_i$ (derived via policy pass rate as solver reward). HRPO’s objective leverages group-level baselines:

$J(\theta) = \mathbb{E}_{(x_i, y_i) \sim \pi_\theta(\cdot \mid R), \; r_i = r(x_i, y_i)}\Biggl[\, \frac{1}{N}\sum_{h\in H}\sum_{i:\,h_i = h} \log\pi_\theta(y_i\mid x_i, R)\;A_{i,h} \Biggr] \;-\;\beta\,D_{\mathrm{KL}}\bigl[\pi_\theta\;\|\;\pi_{\theta_\text{old}}\bigr].$

Here, $A_{i,h}$ is the group-normalized advantage within hop group $h$ :

$A_{i,h} = \frac{r_i - \mathbb{E}_{j:\,h_j=h}[r_j]}{\sqrt{\mathrm{Var}_{j:\,h_j=h}[r_j]} + \varepsilon}$

where $\varepsilon$ is a small stabilizer constant (e.g., $1\text{e}{-8}$ ). KL regularization with weight $\beta$ enforces conservative policy updates. There is no explicit clustering loss; the only additional cost is the KL divergence term.

3. HRPO Algorithmic Workflow

The HRPO training loop can be summarized as:

Data Collection: For each $i$ in $1$ to $N$ , sample prompt $x_i$ , generate $y_i$ via a single rollout, compute reward $r_i$ using the solver’s pass rate, and record hop count $h_i$ .
Grouping: Partition the batch into groups $G_h$ indexed by hop count $h$ , where $G_h = \{i : h_i = h\}$ .
Baseline Computation: For each non-empty group $G_h$ , compute the mean $\mu_h = \mathrm{mean}_{i \in G_h}(r_i)$ and variance $\sigma^2_h = \mathrm{Var}_{i \in G_h}(r_i)$ .
Advantage Calculation: For each sample, compute the standardized advantage $A_{i, h_i}$ as above.
Policy Gradient Update: Minimize the objective across all hop groups, apply policy gradient step, and update reference policy $\theta_\text{old}$ .

No value function (“critic”) is used, and there is no ratio-clipping as employed by PPO. Grouping by hop count is computationally negligible since hop counts are byproducts of question generation.

4. Comparison to GRPO and PPO

A summary of efficiency and structural distinctions among HRPO and comparable methods:

Optimizer	Sampling Cost per Prompt	Variance Reduction	Critic Network	Ratio-Clipping
GRPO	$O(m \cdot n)$	Group-level (per-prompt)	None	Optional
PPO	$O(n)$	Global baseline	Used	Yes
HRPO	$O(n)$	Group-level (per-hop)	None	No

Sampling Efficiency: HRPO reduces rollout cost by approximately $1/m$ compared to GRPO, where $m$ is the number of candidate questions per prompt. With $m=4$ , HRPO achieves about a $75\%$ reduction.
Complexity: HRPO requires $O(N\cdot n)$ solver calls per batch, while GRPO requires $O(N\cdot m \cdot n)$ .
Variance and Stability: HRPO avoids high variance associated with global REINFORCE and obviates the need for value function fitting required in actor–critic variants, contributing to robust convergence even in the absence of PPO-style ratio clipping.

5. Empirical Outcomes

Empirical results for a $3$B parameter backbone demonstrate that HRPO achieves or exceeds the performance of GRPO:

Average exact match (EM) score across seven QA benchmarks: HRPO $0.326$ vs. GRPO $0.320$.
Rollout cost ratio: HRPO/GRPO $\approx 1/4$ .
HRPO offers superior performance on single-hop benchmarks (e.g., NQ: $0.397$ vs. $0.361$).
GRPO retains a slight advantage on four-hop (most complex) benchmarks, likely due to the additional samples per prompt used in GRPO.
Overall, HRPO reduces expensive tool-based rollouts by $\approx 75\%$ without compromising accuracy or stability.

6. Limitations, Hyperparameters, and Practical Considerations

Granularity of Grouping: Grouping strictly by hop count provides only a coarse partition; within-group difficulty variance remains.
Variance in Small Groups: Hop groups with few samples can result in high-variance baseline estimates, especially for rare multi-hop cases.
Hyperparameters: Key settings include $\beta$ (KL weight, usually $10^{-3}$ ), $\epsilon$ (advantage stabilizer, typically $0.2$), and learning rates ( $5 \times 10^{-7}$ to $1 \times 10^{-6}$ for the proposer, $1 \times 10^{-6}$ for the solver). Default hop group proportions in synthetic curricula are $4:3:2:1$ for hops $1$ through $4$.
Compute Requirements: HRPO’s cost savings are particularly significant in settings where tool-invocation rollouts are much costlier than policy updates.

7. Extensions and Potential Applications

HRPO’s design admits potential generalizations:

Finer-grained structural clustering could be achieved by applying graph embeddings or edit distances on entire reasoning chains, rather than only hop count equality.
Adaptive grouping, where clusters are periodically updated using a learned similarity metric, may further refine variance reduction.
A hybrid actor–critic architecture could be introduced, with per-hop group critics for further variance minimization.
HRPO is applicable to other multi-turn tool-use domains, including code generation and dialogue pipelines, whenever a natural notion of stepwise decision complexity (“hop structure”) exists.

In summary, Hop-Grouped Relative Policy Optimization enables scalable, low-variance training of autonomous question-proposing agents, offering a substantial reduction in compute without sacrificing stability or accuracy, and is particularly well suited for data-free, self-evolving LLM agent frameworks (Yue et al., 11 Jan 2026).

PDF Markdown Chat (Pro)

References (1)

Dr. Zero: Self-Evolving Search Agents without Training Data (2026)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Hop-Grouped Relative Policy Optimization (HRPO).