Dice Question Streamline Icon: https://streamlinehq.com

Stabilizing long-horizon training and allocating SFT vs RL steps for optimal performance

Determine principled strategies to stabilize long-horizon training for data-analytic agents and to optimally allocate training steps between supervised fine-tuning and reinforcement learning in order to achieve maximal performance.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper targets training generalist open-source data-analytic agents capable of code-based multi-turn reasoning. While many works adopt an SFT-then-RL paradigm, the authors explicitly note that, in new scenarios, it is unclear how to ensure stability during long-horizon training and how to distribute training effort between SFT and RL for best results.

They propose a dynamically weighted objective combining SFT and RL and provide empirical analyses, but the general problem of principled stabilization and allocation remains explicitly flagged as unclear.

References

Yet, in a new scenario, it remains unclear how to stabilize long-horizon agent training and how to allocate training steps across SFT and RL to achieve optimal performance.

Scaling Generalist Data-Analytic Agents (2509.25084 - Qiao et al., 29 Sep 2025) in Section 1 Introduction, Challenges (2) Improper training strategy