Differential gains from "thinking" versus "instruct" models for research plan generation
Determine whether the Qwen-3-4B-Thinking model can provide measurable improvements over the Qwen-3-4B-Instruct model for research plan generation under the rubric-based reinforcement learning setup, and identify the training scale or conditions under which such differential gains emerge.
Sponsor
References
We only train the instruct models in our work as we observe very similar performance at the scale we can train on before overoptimization begins, with thinking taking ~2x more training time per step for our setting, and more memory (which matters for the 30b run). We hypothesize that the Qwen-3 thinking model's training emphasis on mathematical and coding tasks limits its ability to leverage additional test-time compute for this task, and we were unable to elicit any differential gains. We do expect thinking can help research plan generation in future work, though this may require larger scale.