R-HORIZON: Multi-Step Reasoning Benchmark

Updated 13 October 2025

R-HORIZON is a benchmark paradigm that rigorously tests large reasoning models by composing interdependent queries to simulate extended multi-step problem solving.
It systematically increases reasoning depth by linking seed problems, revealing a sharp accuracy drop from 87.3% to 24.6% as chain length grows.
Training with reinforcement learning on R-HORIZON tasks boosts multi-step reasoning performance by +17.4% and improves single-step accuracy by +7.5%.

R-HORIZON, in the context of large reasoning models (LRMs), is a methodology and benchmark paradigm designed to evaluate and improve the breadth and depth of multi-step, long-horizon reasoning in advanced LLMs. It departs from conventional single-horizon evaluation by employing query composition to link independent tasks into complex, interdependent multi-stage reasoning chains, thereby simulating the kinds of long-horizon cognitive demands encountered in real-world problem solving scenarios (Lu et al., 9 Oct 2025).

1. Motivation and Definition

R-HORIZON targets a core limitation in the evaluation and training of modern LRMs—namely, the inability of standard benchmarks and datasets to rigorously assess multi-step, compositional, or “long-horizon” reasoning. Most existing tests sample isolated, immediate reasoning problems (single-horizon), failing to capture the challenges imposed by complex inference chains characterized by deep interdependence between subtasks. R-HORIZON defines a “reasoning horizon” as the chained depth (number of linked steps or problems) that a model must traverse while maintaining accuracy and context integrity. The R-HORIZON method systematically generates and curates benchmarks in which each reasoning step depends nontrivially on the previous, thereby exposing the limits of current model architectures along both the breadth and depth axes.

2. Benchmark Construction via Query Composition

R-HORIZON operationalizes its goal through a “query composition” process designed to yield sequences of interdependent problems from existing “seed” benchmarks. For example, in mathematical tasks, relevant integer or key variables are extracted from a base problem and injected into subsequent tasks, so that the answer to problem i explicitly constrains the solvability of problem i+1. The process generalizes to other domains (such as code synthesis or agentic workflow execution) by enforcing variable binding or state propagation across problem boundaries. The resulting composed problem set—R-HORIZON benchmark—systematically increases the reasoning horizon, allowing precise control over chain length and dependency complexity. Formally, the dependency between consecutive queries is captured via a function $f_i(x)$ , e.g.,

$f_i(x) = x + (m_{i+1} - a_i)$

where $a_i$ is the answer to the i-th subproblem and $m_{i+1}$ is a placeholder constant in the (i+1)-th subproblem. The benchmark thus captures not merely a test of recall or local reasoning but a “path-dependent” inference scenario.

3. Empirical Characterization of LRM Reasoning Limits

Extensive evaluation of 25 mainstream LRMs using the R-HORIZON benchmark reveals a steep decline in model performance as the reasoning horizon is increased. While models achieve near-ceiling performance on single-step (horizon-1) queries (e.g., 87.3%), accuracy collapses to significantly lower values (e.g., 24.6%) when solving chains of five interdependent problems—a hallmark of limited effective reasoning length. Further analysis uncovers qualitative defects:

Poor allocation of “thinking budget”: models devote disproportionate resources to early steps, failing to maintain context for later stages;
Localized reflection: the models rarely initiate global error correction or self-reflective strategies spanning multiple tasks;
Horizon-specific bottlenecks: performance degradation can be traced to compounding error at intermediate steps, confirming the absence of robust “memory” or planning capacity beyond a short context window. These findings constitute strong evidence that current LRMs are not reliably compositional in their reasoning and cannot maintain consistent performance or calibration over extended multi-step inference chains.

4. Training Enhancements: Reinforcement Learning with Verified Rewards

To move beyond diagnosis, R-HORIZON is leveraged as a data generation paradigm for training. Specifically, the composed multi-horizon tasks are used as input for reinforcement learning with verified rewards (RLVR), where the model receives feedback either at the final step only ( $R_{last}$ ) or at every intermediate stage ( $R_{all}$ ). Optimization is performed via a GRPO-style policy objective: $\mathcal{J}_{\text{GRPO}}(\theta) = \mathbb{E}_{q,\{o_i\}} \left[ \frac{1}{\sum_i |o_i|} \sum_{i,t} \left\{ \min\left( r_{i,t} \hat{A}_{i,t}, \operatorname{clip}(r_{i,t}, 1-\epsilon, 1+\epsilon) \hat{A}_{i,t} \right) - \beta D_{\text{KL}}[\pi_{\theta} \| \pi_{\text{ref}} ] \right\} \right]$ where $r_{i,t}$ is the policy ratio and $\hat{A}_{i,t}$ the advantage estimate. Training on R-HORIZON-composed data confers substantial improvements—+17.4 performance on challenging composed benchmarks, and unexpectedly, a +7.5 gain on standard (single-horizon) benchmarks such as AIME2024—demonstrating enhanced general reasoning capability.

5. Evaluation Metrics and Theoretical Models

R-HORIZON adopts an “all-or-nothing” evaluation metric: a chain is marked as correct if all subproblems are answered correctly and identically with the ground truth. Theoretical baseline accuracy is computed assuming independent step success as

$\text{Acc}_{\text{expected}}(\mathcal{Q}) = \prod_{i=1}^n p_i$

for a chain of $n$ subproblems with atomic pass rates $p_i$ . Empirically, actual performance falls below this bound as the horizon increases, reflecting cumulative error propagation and lack of robust cross-problem context management.

6. Scalability, Control, and Cost-Efficiency

A central strength of R-HORIZON is its scalability and controllability. Because benchmark generation is accomplished by composition from existing single-horizon datasets, the approach is highly economical; one can vary the chain length or composition pattern to modulate difficulty while avoiding the overhead of domain-specific handcrafting or synthetic dataset engineering. This makes R-HORIZON an attractive approach for stress-testing and augmenting LRMs in diverse, resource-constrained research and development settings.

7. Implications and Future Directions

The results obtained via R-HORIZON indicate clear research priorities:

Model architecture improvements targeting long-range dependency tracking, global planning, and dynamic resource allocation;
Systematic use of R-HORIZON-composed training data in reinforcement learning and curriculum self-improvement loops;
Development of richer metrics and diagnostic tools sensitive to context propagation, cross-step reflection, and catastrophic error accumulation.

A plausible implication is that, as multi-step chain-of-thought approaches become central to the next generation of LRMs, benchmarks and training schemes based on R-HORIZON-like methodology will become essential to both evaluation and practical deployment of truly compositional reasoning systems.

Summary Table: R-HORIZON Key Properties

Property	Description	Significance
Query Composition	Multi-step tasks constructed from interdependent single-step problems	Enables controlled long-horizon benchmark design
Evaluation Metric	All-or-nothing chain accuracy	Assesses true compositional/holistic reasoning ability
Empirical Finding	LRM accuracy drops sharply with horizon length increase	Reveals limited effective reasoning length
RLVR Training Gain	RL on composed data boosts multi-horizon and standard task accuracy (+7.5)	Demonstrates data-driven performance improvement
Cost Profile	Low, due to use of preexisting single-step data for composition	Facilitates scalable and systematic research

In conclusion, the R-HORIZON paradigm establishes both a rigorous stress test and a scalable improvement pathway for the long-horizon reasoning capabilities of large reasoning models, with demonstrated impact in both systematic evaluation and reinforcement learning-based enhancement (Lu et al., 9 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth? (2025)

R-HORIZON: Multi-Step Reasoning Benchmark

1. Motivation and Definition

2. Benchmark Construction via Query Composition

3. Empirical Characterization of LRM Reasoning Limits

4. Training Enhancements: Reinforcement Learning with Verified Rewards

5. Evaluation Metrics and Theoretical Models

6. Scalability, Control, and Cost-Efficiency

7. Implications and Future Directions

Whiteboard

Follow Topic

Continue Learning

R-HORIZON: Multi-Step Reasoning Benchmark

1. Motivation and Definition

2. Benchmark Construction via Query Composition

3. Empirical Characterization of LRM Reasoning Limits

4. Training Enhancements: Reinforcement Learning with Verified Rewards

5. Evaluation Metrics and Theoretical Models

6. Scalability, Control, and Cost-Efficiency

7. Implications and Future Directions

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics