- The paper introduces NaturalThoughts, a novel dataset for transferring reasoning skills from large teacher models to smaller student models.
- It employs selection strategies based on diversity and difficulty to optimize the use of annotated reasoning traces for improved sample efficiency.
- Experimental results demonstrate that mixed system-1/system-2 distillation outperforms baselines on benchmarks like GPQA-Diamond and MATH-500.
NaturalThoughts: Systematic Selection and Distillation of Reasoning Traces for General Reasoning Tasks
The paper "NaturalThoughts: Selecting and Distilling Reasoning Traces for General Reasoning Tasks" (2507.01921) presents a comprehensive paper on the data-centric factors that influence the distillation of reasoning capabilities from large teacher models to smaller student models. The authors introduce the NaturalThoughts dataset, constructed by sampling and annotating reasoning traces from a strong teacher model (DeepSeek-R1) on a diverse set of questions from the NaturalReasoning dataset. The work systematically analyzes how the selection and curation of reasoning traces—along axes such as scale, diversity, and difficulty—affect the sample efficiency and scalability of reasoning distillation for general-purpose LLMs.
Key Contributions and Methodology
The paper is motivated by the observation that supervised fine-tuning (SFT) on teacher-generated reasoning traces is more effective than reinforcement learning (RL) alone for imparting reasoning skills to student models. However, prior work has not systematically examined which types of reasoning demonstrations are most beneficial for distillation, especially across diverse domains.
The methodology consists of three main components:
- Data Generation and Annotation:
- Questions are sampled from NaturalReasoning, a large and diverse set of 2.8M questions spanning 13 top-level domains.
- DeepSeek-R1 is used to generate reasoning traces, which are then annotated for domain, reasoning strategies (meta-reasoning primitives), and verbosity using Llama-3.1-70B-Instruct.
- Data Selection Strategies:
- Diversity: Subsets are selected to maximize diversity in question topics, semantic embeddings, and reasoning strategies. The latter is found to be more predictive of downstream performance than question diversity alone.
- Difficulty: Subsets are curated based on reasoning trace length, verbosity, and model disagreement (between teacher models with and without CoT traces), with the hypothesis that more difficult questions elicit richer reasoning traces.
- Distillation Approaches:
- System-2 Distillation: SFT on full reasoning traces (CoT + answer).
- System-1 Distillation: SFT on final answers only.
- Mixed Distillation: Training on a mixture of System-1 and System-2 examples, with mixing ratios determined either randomly or based on question difficulty (e.g., System-2 for difficult questions, System-1 for easy ones).
- Inference-Time Control: Explicit instructions are appended to prompts to control the model's reasoning mode and response length at inference.
Experimental Results
The experiments are conducted on both Llama-3.1-8B-Instruct and Qwen-2.5-7B-Instruct student models, with evaluation on general STEM reasoning benchmarks: GPQA-Diamond, MMLU-Pro, SuperGPQA, and MATH-500.
Main Findings
- Scaling Data Size: Contrary to the "Less is More" hypothesis, simply increasing the number of high-quality reasoning traces (even with random selection) leads to consistent and significant performance gains across all benchmarks. For example, scaling from 1k to 500k examples yields steady improvements, with no observed saturation.
- Selection by Reasoning Strategy Diversity and Difficulty: Selecting examples based on diverse reasoning strategies or model disagreement (as a proxy for difficulty) outperforms random selection, especially at smaller data scales. However, as the dataset size increases, the performance gap narrows.
- Mixed System-1/System-2 Distillation: Difficulty-based mixing (System-2 for hard questions, System-1 for easy ones) achieves a better accuracy-efficiency trade-off than either approach alone or random mixing. This method enables the student model to adaptively allocate compute at inference, achieving higher accuracy with shorter average response lengths.
- Comparison to Baselines: Models distilled with NaturalThoughts outperform those trained on prior curated datasets (LIMO, S1K, OpenThoughts3) on general reasoning tasks, even when using fewer training examples.
Numerical Highlights
- On Llama-3.1-8B-Instruct, training with 500k NaturalThoughts examples achieves 48.3% on GPQA-Diamond and 72.3% on MATH-500, outperforming OpenThoughts3 (1.2M examples) and DeepSeek-R1-Distill-Llama-8B (800k non-public examples) on most benchmarks.
- On Qwen-2.5-7B-Instruct, 500k NaturalThoughts examples yield 48.6% on GPQA-Diamond and 83.6% on MATH-500, surpassing OpenThoughts3 with 1.2M examples on three out of four benchmarks.
Implications and Theoretical Insights
The results challenge the prevailing notion that small, highly curated datasets are sufficient for general reasoning distillation. Instead, the findings indicate that:
- Diversity of Reasoning Primitives: The breadth of reasoning strategies in the training data is more critical than the diversity of question topics or domains. This suggests that future dataset curation should focus on capturing a wide array of reasoning behaviors rather than merely covering more subject areas.
- Difficulty-Driven Selection: Harder questions, which elicit longer and more complex reasoning traces, are more sample-efficient for transferring reasoning skills. This aligns with the intuition that challenging demonstrations provide richer learning signals.
- Steerable Reasoning Efficiency: Mixed System-1/System-2 distillation enables explicit control over the model's reasoning depth at inference, allowing practitioners to trade off between accuracy and computational cost dynamically.
Practical Applications
- Building Small, Generalist Reasoning Models: The NaturalThoughts approach provides a scalable recipe for distilling reasoning into smaller models, making them more suitable for deployment in resource-constrained environments.
- Efficient Inference: The ability to control reasoning depth at inference time is valuable for real-world applications where latency and cost are critical, such as interactive assistants or edge devices.
- Data Curation for Alignment and RL: The insights on data selection are directly applicable to the alignment and RL post-training stages of LLM development, where the choice of demonstration data can significantly impact downstream capabilities.
Limitations and Future Directions
- The paper is limited to off-policy distillation (cross-entropy on teacher outputs). The generality of the findings to on-policy distillation or RL-based approaches remains to be validated.
- The impact of reasoning trace quality on subsequent RL or alignment stages is not explored and warrants further investigation.
- The approach could be extended to other modalities (e.g., vision-LLMs) or to multilingual settings.
Conclusion
This work provides a rigorous, data-centric analysis of reasoning distillation, demonstrating that scaling high-quality, diverse reasoning traces—especially those exhibiting a wide range of reasoning strategies and higher difficulty—yields substantial improvements in general reasoning tasks for LLMs. The findings have direct implications for the design of future reasoning datasets, distillation protocols, and efficient deployment strategies for LLMs.