Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 172 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 99 tok/s Pro
Kimi K2 203 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

R-HORIZON: Multi-Step Reasoning Benchmark

Updated 13 October 2025
  • R-HORIZON is a benchmark paradigm that rigorously tests large reasoning models by composing interdependent queries to simulate extended multi-step problem solving.
  • It systematically increases reasoning depth by linking seed problems, revealing a sharp accuracy drop from 87.3% to 24.6% as chain length grows.
  • Training with reinforcement learning on R-HORIZON tasks boosts multi-step reasoning performance by +17.4% and improves single-step accuracy by +7.5%.

R-HORIZON, in the context of large reasoning models (LRMs), is a methodology and benchmark paradigm designed to evaluate and improve the breadth and depth of multi-step, long-horizon reasoning in advanced LLMs. It departs from conventional single-horizon evaluation by employing query composition to link independent tasks into complex, interdependent multi-stage reasoning chains, thereby simulating the kinds of long-horizon cognitive demands encountered in real-world problem solving scenarios (Lu et al., 9 Oct 2025).

1. Motivation and Definition

R-HORIZON targets a core limitation in the evaluation and training of modern LRMs—namely, the inability of standard benchmarks and datasets to rigorously assess multi-step, compositional, or “long-horizon” reasoning. Most existing tests sample isolated, immediate reasoning problems (single-horizon), failing to capture the challenges imposed by complex inference chains characterized by deep interdependence between subtasks. R-HORIZON defines a “reasoning horizon” as the chained depth (number of linked steps or problems) that a model must traverse while maintaining accuracy and context integrity. The R-HORIZON method systematically generates and curates benchmarks in which each reasoning step depends nontrivially on the previous, thereby exposing the limits of current model architectures along both the breadth and depth axes.

2. Benchmark Construction via Query Composition

R-HORIZON operationalizes its goal through a “query composition” process designed to yield sequences of interdependent problems from existing “seed” benchmarks. For example, in mathematical tasks, relevant integer or key variables are extracted from a base problem and injected into subsequent tasks, so that the answer to problem i explicitly constrains the solvability of problem i+1. The process generalizes to other domains (such as code synthesis or agentic workflow execution) by enforcing variable binding or state propagation across problem boundaries. The resulting composed problem set—R-HORIZON benchmark—systematically increases the reasoning horizon, allowing precise control over chain length and dependency complexity. Formally, the dependency between consecutive queries is captured via a function fi(x)f_i(x), e.g.,

fi(x)=x+(mi+1ai)f_i(x) = x + (m_{i+1} - a_i)

where aia_i is the answer to the i-th subproblem and mi+1m_{i+1} is a placeholder constant in the (i+1)-th subproblem. The benchmark thus captures not merely a test of recall or local reasoning but a “path-dependent” inference scenario.

3. Empirical Characterization of LRM Reasoning Limits

Extensive evaluation of 25 mainstream LRMs using the R-HORIZON benchmark reveals a steep decline in model performance as the reasoning horizon is increased. While models achieve near-ceiling performance on single-step (horizon-1) queries (e.g., 87.3%), accuracy collapses to significantly lower values (e.g., 24.6%) when solving chains of five interdependent problems—a haLLMark of limited effective reasoning length. Further analysis uncovers qualitative defects:

  • Poor allocation of “thinking budget”: models devote disproportionate resources to early steps, failing to maintain context for later stages;
  • Localized reflection: the models rarely initiate global error correction or self-reflective strategies spanning multiple tasks;
  • Horizon-specific bottlenecks: performance degradation can be traced to compounding error at intermediate steps, confirming the absence of robust “memory” or planning capacity beyond a short context window. These findings constitute strong evidence that current LRMs are not reliably compositional in their reasoning and cannot maintain consistent performance or calibration over extended multi-step inference chains.

4. Training Enhancements: Reinforcement Learning with Verified Rewards

To move beyond diagnosis, R-HORIZON is leveraged as a data generation paradigm for training. Specifically, the composed multi-horizon tasks are used as input for reinforcement learning with verified rewards (RLVR), where the model receives feedback either at the final step only (RlastR_{last}) or at every intermediate stage (RallR_{all}). Optimization is performed via a GRPO-style policy objective: JGRPO(θ)=Eq,{oi}[1ioii,t{min(ri,tA^i,t,clip(ri,t,1ϵ,1+ϵ)A^i,t)βDKL[πθπref]}]\mathcal{J}_{\text{GRPO}}(\theta) = \mathbb{E}_{q,\{o_i\}} \left[ \frac{1}{\sum_i |o_i|} \sum_{i,t} \left\{ \min\left( r_{i,t} \hat{A}_{i,t}, \operatorname{clip}(r_{i,t}, 1-\epsilon, 1+\epsilon) \hat{A}_{i,t} \right) - \beta D_{\text{KL}}[\pi_{\theta} \| \pi_{\text{ref}} ] \right\} \right] where ri,tr_{i,t} is the policy ratio and A^i,t\hat{A}_{i,t} the advantage estimate. Training on R-HORIZON-composed data confers substantial improvements—+17.4 performance on challenging composed benchmarks, and unexpectedly, a +7.5 gain on standard (single-horizon) benchmarks such as AIME2024—demonstrating enhanced general reasoning capability.

5. Evaluation Metrics and Theoretical Models

R-HORIZON adopts an “all-or-nothing” evaluation metric: a chain is marked as correct if all subproblems are answered correctly and identically with the ground truth. Theoretical baseline accuracy is computed assuming independent step success as

Accexpected(Q)=i=1npi\text{Acc}_{\text{expected}}(\mathcal{Q}) = \prod_{i=1}^n p_i

for a chain of nn subproblems with atomic pass rates pip_i. Empirically, actual performance falls below this bound as the horizon increases, reflecting cumulative error propagation and lack of robust cross-problem context management.

6. Scalability, Control, and Cost-Efficiency

A central strength of R-HORIZON is its scalability and controllability. Because benchmark generation is accomplished by composition from existing single-horizon datasets, the approach is highly economical; one can vary the chain length or composition pattern to modulate difficulty while avoiding the overhead of domain-specific handcrafting or synthetic dataset engineering. This makes R-HORIZON an attractive approach for stress-testing and augmenting LRMs in diverse, resource-constrained research and development settings.

7. Implications and Future Directions

The results obtained via R-HORIZON indicate clear research priorities:

  • Model architecture improvements targeting long-range dependency tracking, global planning, and dynamic resource allocation;
  • Systematic use of R-HORIZON-composed training data in reinforcement learning and curriculum self-improvement loops;
  • Development of richer metrics and diagnostic tools sensitive to context propagation, cross-step reflection, and catastrophic error accumulation.

A plausible implication is that, as multi-step chain-of-thought approaches become central to the next generation of LRMs, benchmarks and training schemes based on R-HORIZON-like methodology will become essential to both evaluation and practical deployment of truly compositional reasoning systems.


Summary Table: R-HORIZON Key Properties

Property Description Significance
Query Composition Multi-step tasks constructed from interdependent single-step problems Enables controlled long-horizon benchmark design
Evaluation Metric All-or-nothing chain accuracy Assesses true compositional/holistic reasoning ability
Empirical Finding LRM accuracy drops sharply with horizon length increase Reveals limited effective reasoning length
RLVR Training Gain RL on composed data boosts multi-horizon and standard task accuracy (+7.5) Demonstrates data-driven performance improvement
Cost Profile Low, due to use of preexisting single-step data for composition Facilitates scalable and systematic research

In conclusion, the R-HORIZON paradigm establishes both a rigorous stress test and a scalable improvement pathway for the long-horizon reasoning capabilities of large reasoning models, with demonstrated impact in both systematic evaluation and reinforcement learning-based enhancement (Lu et al., 9 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to R-HORIZON.