- The paper introduces an open-source data curation pipeline with over 1,000 ablation experiments, generating the OpenThoughts3-1.2M dataset and training the OpenThinker3-7B model.
- It systematically evaluates various strategies for question sourcing, filtering, deduplication, and multiple answer sampling to optimize supervised finetuning performance.
- The study demonstrates that focused question quality and extensive answer diversity, such as 16x sampling per question and strategic teacher model selection, yield state-of-the-art reasoning benchmarks.
The "OpenThoughts: Data Recipes for Reasoning Models" paper (OpenThoughts: Data Recipes for Reasoning Models, 4 Jun 2025) addresses the challenge of developing state-of-the-art reasoning capabilities in LLMs due to the lack of publicly available datasets and training recipes. The OpenThoughts project aims to create open-source datasets and models to democratize research in this area. This paper details the creation of OpenThoughts3-1.2M, a large-scale, open reasoning dataset, and OpenThinker3-7B, a model trained on this dataset that achieves state-of-the-art performance among open-data models at its scale.
The core of the paper is a systematic empirical investigation into different stages of the data curation pipeline for supervised finetuning (SFT) reasoning models. Through over 1,000 ablation experiments across math, code, and science domains, the authors explored various strategies for generating question-answer pairs. The goal was to identify the most effective data recipes for enhancing reasoning. The pipeline experiments were conducted by generating 31,600 data points for each strategy and finetuning the Qwen2.5-7B-Instruct model, evaluating performance on a suite of reasoning benchmarks.
The key stages of the OpenThoughts3 data pipeline investigated include:
- Question Sourcing: Evaluating numerous sources (synthetic, semi-synthetic, non-synthetic) for math, code, and science questions. The experiments showed significant variance in performance based on the source, with no single generation strategy consistently outperforming others across all domains. Top-performing sources for code included StackExchange CodeGolf and OpenCodeReasoning, for math included OpenMath-2-Math and NuminaMath, and for science included StackExchange Physics and LLM-extracted questions from organic chemistry texts.
- Mixing Questions: Investigating whether mixing questions from multiple sources improves performance. Counter-intuitively, mixing more than a few high-quality sources degraded performance. The results suggest that focusing on question quality from a limited number of top sources is more effective than increasing diversity by including lower-ranked sources.
- Question Filtering: Exploring various methods to select high-quality questions from large pools. LLM-based methods, such as filtering by LLM-labeled difficulty (best for code) or selecting questions eliciting long LLM responses (best for math and science), significantly outperformed classical methods like FastText classifiers or embedding-based filters.
- Deduplication and Sampling Multiple Answers: Analyzing the interplay between question deduplication and generating multiple answers per question from the teacher model. While deduplication aims to increase question diversity, sampling multiple answers per question (16× sampling yielded good results) increases answer diversity for a potentially smaller set of unique questions. The experiments showed that high sampling rates (e.g., 16x) were beneficial, sometimes even without strict deduplication, suggesting that answer diversity can be a strong lever for performance.
- Answer Filtering: Testing various techniques to filter out low-quality or incorrect answers generated by the teacher. Surprisingly, none of the tested answer filtering strategies consistently outperformed simply using all generated answers. This suggests that for the evaluated scenarios, the benefits of filtering do not outweigh the reduction in dataset size or computational cost.
- Teacher Model Selection: Comparing the performance of different teacher models (DeepSeek-R1, Phi-4-Reasoning-Plus-14B, QwQ-32B) for generating reasoning traces. Despite scoring lower on target benchmarks empirically, QwQ-32B proved to be a stronger teacher model than DeepSeek-R1 across all domains. This finding challenges the intuition that a higher-scoring model is always a better teacher for distillation.
Based on these findings, the authors constructed the OpenThoughts3-1.2M dataset, comprising approximately 850,000 math, 250,000 code, and 100,000 science data points. This composition was based on the ratio found effective in the previous OpenThoughts2-1M dataset. The dataset was generated by applying the best strategies identified in the pipeline experiments: selecting high-quality question sources (OpenMath-2-Math for math, CodeGolf and OpenCodeReasoning for code, StackExchange Physics and OrganicChemistryPDFs for science), using LLM-based filtering (difficulty for code, response length for math/science), performing exact deduplication for math/science and no deduplication for code, sampling 16 answers per question, and using QwQ-32B as the teacher without applying answer filtering.
The resulting model, OpenThinker3-7B, trained on OpenThoughts3-1.2M, achieved state-of-the-art performance among open-data models at the 7B scale across various reasoning benchmarks, including held-out tasks like AIME 2025 and LiveCodeBench 06/24-01/25. The paper emphasizes that these results were achieved solely through innovations in SFT data curation, without employing additional techniques like reinforcement learning.
Key insights from the paper include:
- Scaling data through repeated sampling from a teacher is effective.
- A model's benchmark performance doesn't perfectly correlate with its effectiveness as a teacher.
- Extensive answer verification and filtering did not provide significant benefits in their setup.
- Quality and domain expertise of question sources are more critical than simple diversity from many sources.
- LLM-based methods are powerful tools for question filtering based on characteristics like difficulty or response length.
The paper acknowledges limitations, such as not exploring RL or staged SFT, and highlights open questions regarding cross-domain transfer effects when data is mixed, the potential for weak-to-strong generalization, and the interaction between question and answer diversity across different scales and domains.
For practical implementation, the paper provides details on the specific question sources used, the types of LLM prompts for filtering, the deduplication methods, and the teacher models. The release of the OpenThoughts3-1.2M dataset and OpenThinker3-7B model, along with the codebase, provides a valuable resource for practitioners and researchers to build upon these findings. The systematic experimental methodology serves as a recipe for how to empirically optimize data pipelines for SFT. Appendix sections also offer practical insights into training hyperparameters, evaluation setup (using Evalchemy), data decontamination methods, the importance of long reasoning traces (self-reflection and length), findings on selecting shortest answers during inference (which are often more accurate and computationally cheaper), and the computational resources required. The safety analysis indicates a trade-off where improved reasoning capabilities can come at the cost of increased harmfulness without specific safety alignment data.