Differential gains from "thinking" versus "instruct" models for research plan generation

Determine whether the Qwen-3-4B-Thinking model can provide measurable improvements over the Qwen-3-4B-Instruct model for research plan generation under the rubric-based reinforcement learning setup, and identify the training scale or conditions under which such differential gains emerge.

Background

The paper compares rubric-RL finetuning of Qwen-3 models in both "instruct" and "thinking" variants for research plan generation. In their experiments, the authors observed similar performance between the two variants at the scale they could train, while the thinking variant incurred roughly 2x training time per step and higher memory usage.

They hypothesize that the Qwen-3 thinking model’s training focus on math and coding may limit its ability to leverage additional test-time compute for this particular task. Despite this hypothesis, they were unable to extract any clear improvements from the thinking model and suggest that larger-scale training might be necessary to reveal benefits.

References

We only train the instruct models in our work as we observe very similar performance at the scale we can train on before overoptimization begins, with thinking taking ~2x more training time per step for our setting, and more memory (which matters for the 30b run). We hypothesize that the Qwen-3 thinking model's training emphasis on mathematical and coding tasks limits its ability to leverage additional test-time compute for this task, and we were unable to elicit any differential gains. We do expect thinking can help research plan generation in future work, though this may require larger scale.

Training AI Co-Scientists Using Rubric Rewards  (2512.23707 - Goel et al., 29 Dec 2025) in Appendix, Section "Training Qwen-3-4B Instruct vs Thinking"