Designing high-quality RL training data for long-context reasoning

Determine effective principles and methodologies for designing high-quality reinforcement learning training data tailored to long-context reasoning in large language models, specifying the properties such data must have to reliably elicit advanced reasoning behaviors and support robust evaluation.

Background

The paper surveys prior work on long-context reasoning and notes that QwenLong-L1 extends reinforcement learning to sequences up to 60K tokens, encouraging longer reasoning trajectories. Despite this progress, the authors explicitly point out that critical questions remain unresolved regarding the construction of suitable RL training data for long-context scenarios.

This uncertainty motivates the core contribution of the paper: KeyChain, a synthesis approach that aims to create high-difficulty, verifiable long-context reasoning tasks from short multi-hop QA. The explicit mention of open questions about RL data design frames the necessity of their data-centric approach.

References

However, it leaves open key questions about how to design high-quality RL training data.

LoongRL:Reinforcement Learning for Advanced Reasoning over Long Contexts (2510.19363 - Wang et al., 22 Oct 2025) in Related Works, subsection “Reasoning and Long-Context Reasoning”