Training methods for long-form deep research models

Determine effective procedures to train language models directly for long-form deep research tasks, including how to structure training objectives, supervision, and reinforcement learning signals so that models reliably learn to produce high-quality, evidence-grounded long-form answers.

Background

The paper introduces DR Tulu-8B and a reinforcement-learning method, RLER, to train long-form deep research behavior. While the approach shows strong empirical results, the authors explicitly note that it is not yet clear what training regimes are best for models directly targeting long-form tasks.

This open question encompasses choices such as rubric design, verifier configuration, reward composition, and data mixtures that balance long-form and short-form supervision. Clarifying these design decisions would help standardize training practices for future deep research systems.

References

Many open questions remain around how best to train models directly on long-form tasks; to facilitate future research on this topic, we release all of our data, models, and code, including an MCP-based deep research library and evaluation suite with asynchronous tool-calling support.

— DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research (2511.19399 - Shao et al., 24 Nov 2025) in Discussion and Future Work

Training methods for long-form deep research models

Background

References

Related Problems