Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

NaturalThoughts: Dataset for Reasoning Transfer

Updated 4 July 2025
  • NaturalThoughts is a large-scale, curated dataset designed to distill general reasoning capabilities in language models through systematic chain-of-thought traces.
  • It compiles 2.8 million questions from 13 diverse domains with detailed annotations and filtering for strategy diversity, difficulty, and balanced verbosity.
  • Empirical evaluations show that models fine-tuned on NaturalThoughts achieve superior sample efficiency and transfer performance across STEM and professional benchmarks.

NaturalThoughts is a large-scale, systematically curated dataset designed for distilling and transferring general reasoning capabilities in LLMs. Developed to address questions of sample efficiency, strategy diversity, and scalability in reasoning trace distillation, it builds on the premise that the nature and selection of reasoning demonstrations from a teacher model significantly affect student model performance on open-domain reasoning tasks across STEM and professional benchmarks.

1. Composition and Structure

NaturalThoughts sources its questions from the NaturalReasoning dataset (Yuan et al., 2025), comprising 2.8 million questions across 13 domains: Engineering, Philosophy, Medicine, Economics, Science, Law, History, Education, Management, Literature and Arts, Agronomy, Sociology, and Military Science. Each item is constructed as a tuple:

  • Question: Drawn from one of the 13 domains, with topic/domain annotation via taxonomy from Du et al. (2025), implemented using Llama-3.1-70B-Instruct.
  • Reasoning Trace: Chain-of-thought (CoT) tokens (> ... ) generated by the teacher model, representing “System-2” reasoning.
  • Final Answer: Concise “System-1” solution provided by the teacher.
  • Annotations: Each trace is labeled for reasoning strategy frequency and type (such as self-verification, backtracking, exploration), verbosity score (0–10), and domain/topic.

The teacher model employed is DeepSeek-R1, recognized for strong multi-step reasoning. Each item maintains explicit alignment between question, reasoning process, and answer, enabling supervised fine-tuning of student models.

2. Distillation and Curation Methodology

The dataset curation pipeline emphasizes both the diversity and difficulty of reasoning traces. The creation process comprises:

  1. Sampling: Questions are selected from NaturalReasoning based on coverage and distribution goals.
  2. Teacher Prompting: DeepSeek-R1 generates stepwise reasoning traces and final answers.
  3. Annotation: Automated labeling encodes domain/topic, strategy types, and verbosity. Reasoning strategies are extracted and classified according to a defined meta-reasoning schema.
  4. Filtering and Selection: Final dataset selection is determined by:
    • Diversity of Reasoning Strategies: Priority is given to traces deploying multiple unique tactics (optimal range: 4–8 strategies per example).
    • Difficulty: Preference for longer, more complex questions or those on which strong models disagree.
    • Verbosity Control: Selection avoids both overly terse and excessively rambling explanations.
    • Topic/Domain Uniformity: Ensures balanced coverage across all domains.

Quantitatively, filtering is implemented as follows for response length:

p=(lC)τp = \left( \frac{l}{C} \right)^\tau

where ll is trace length (tokens), C=5000C = 5000, and τ=2.5\tau = 2.5, penalizing short CoTs.

Systematic approaches ensure the dataset is not simply a random sample, but a curated selection that maximizes exposure to diverse, demanding, and varied reasoning patterns.

3. Systematic Analysis: Diversity, Difficulty, and Scaling

NaturalThoughts provides empirical analyses into factors governing efficient reasoning transfer from teacher to student:

  • Reasoning Strategy Diversity: Training on traces exhibiting higher diversity of meta-reasoning strategies outperforms selection based only on topic or semantic variety.
  • Sample Difficulty: Including samples that are hard (e.g., where models disagree) or inherently complex (long CoTs) enhances transfer efficiency, especially in lower-data regimes.
  • Verbosity Optimization: Traces of moderate verbosity yield better student learning than extremes.
  • Scaling: Increasing dataset size via random sampling produces robust, nearly linear performance gains (tested up to 500k samples), with no observed saturation. However, targeted selection by strategy and difficulty remains most sample-efficient, particularly at lower scales.

Contrary to the "Less is More" hypothesis (LIMO, S1K), substantial improvements accrue even as data volume increases, though the marginal advantage of highly targeted selection narrows at larger sizes.

4. Benchmark Evaluation and Comparative Performance

NaturalThoughts is employed to fine-tune models including Llama-3.1-8B-Instruct, Qwen-2.5-7B-Instruct, and Llama-3.3-70B-Instruct. Evaluation is conducted on diverse benchmarks:

  • GPQA-Diamond: General question answering.
  • MATH-500: Mathematical reasoning.
  • MMLU-Pro: Multi-domain professional assessment.
  • SuperGPQA: Broad STEM/general reasoning.

The primary metric is pass@1 accuracy, averaged over several random seeds. Comparative baselines include OpenThoughts3, LIMO, S1K, and DeepSeek-R1-Distill. Notable empirical findings include:

  • With as few as 1,000 curated NaturalThoughts samples, models matched or surpassed the performance of LIMO/S1K-trained peers.
  • Scaling to 500k samples, NaturalThoughts outperformed OpenThoughts3, even when the latter was trained with 1.2M examples.
  • Subsets selected for “model disagreement” or “reasoning strategy diversity” provide highest sample efficiency.
  • Models trained on NaturalThoughts excel across a broader spectrum, supporting generalized STEM and professional tasks, not limited to math/coding.

5. Key Findings and Implications for General Reasoning

The data-centric analytic results establish that:

  • Both scale and targeted curation contribute critically to transferable reasoning capability.
  • Reasoning diversity (in strategies) and difficulty are more impactful than mere content/topic variety.
  • A mixed distillation strategy (fractional System-1/System-2 supervision) allows downstream models to adapt reasoning depth to inference-time demands, improving efficiency–accuracy trade-offs.
  • The methodology supports robust cross-domain generalization, beyond prior hand-curated datasets primarily focused on mathematical and programming tasks.

A plausible implication is that future dataset construction for reasoning tasks should prioritize both qualitative diversity of reasoning patterns and challenging problem instances, alongside continued scaling.

6. Summary Table: NaturalThoughts Attributes

Aspect Details
Source NaturalReasoning (2.8M questions, 13 domains)
Teacher Model DeepSeek-R1 (trace + answer)
Selection Criteria Strategy diversity (4–8), length, difficulty, verbosity, domain balance
Curation Pipeline Automated teacher outputs + annotation + strategic filtering
Size Up to 500k in experiments; performance increases with scale
Benchmarks GPQA-Diamond, MATH-500, MMLU-Pro, SuperGPQA
Comparative Advantage Outperforms LIMO, S1K, OpenThoughts, DeepSeek-R1-Distill
Generalization Effective across STEM and knowledge tasks

7. Significance and Prospective Impact

NaturalThoughts offers a new standard for curated reasoning datasets in supervised distillation for LLMs. Its design demonstrates that maximizing both the diversity and difficulty of reasoning traces achieves higher generalization in student models than focusing on content or random scale alone. Its principled curation pipeline, validated by benchmark results, supports transfer of complex reasoning behaviors to smaller models, with potential applications in alignment, RLHF, and deployment to real-world scenarios demanding stepwise, auditable inference. The findings encourage a continued focus on qualitative characteristics of training data, as well as scale, for sustained progress in machine reasoning.