OpenThoughts: Data Recipes for Reasoning Models (2506.04178v2)

Published 4 Jun 2025 in cs.LG

Abstract: Reasoning models have made rapid progress on many benchmarks involving math, code, and science. Yet, there are still many open questions about the best training recipes for reasoning since state-of-the-art models often rely on proprietary datasets with little to no public information available. To address this, the goal of the OpenThoughts project is to create open-source datasets for training reasoning models. After initial explorations, our OpenThoughts2-1M dataset led to OpenThinker2-32B, the first model trained on public reasoning data to match DeepSeek-R1-Distill-32B on standard reasoning benchmarks such as AIME and LiveCodeBench. We then improve our dataset further by systematically investigating each step of our data generation pipeline with 1,000+ controlled experiments, which led to OpenThoughts3. Scaling the pipeline to 1.2M examples and using QwQ-32B as teacher yields our OpenThoughts3-7B model, which achieves state-of-the-art results: 53% on AIME 2025, 51% on LiveCodeBench 06/24-01/25, and 54% on GPQA Diamond - improvements of 15.3, 17.2, and 20.5 percentage points compared to the DeepSeek-R1-Distill-Qwen-7B. All of our datasets and models are available on https://openthoughts.ai.

Summary

The paper introduces an open-source data curation pipeline with over 1,000 ablation experiments, generating the OpenThoughts3-1.2M dataset and training the OpenThinker3-7B model.
It systematically evaluates various strategies for question sourcing, filtering, deduplication, and multiple answer sampling to optimize supervised finetuning performance.
The study demonstrates that focused question quality and extensive answer diversity, such as 16x sampling per question and strategic teacher model selection, yield state-of-the-art reasoning benchmarks.

The "OpenThoughts: Data Recipes for Reasoning Models" paper (OpenThoughts: Data Recipes for Reasoning Models, 4 Jun 2025) addresses the challenge of developing state-of-the-art reasoning capabilities in LLMs due to the lack of publicly available datasets and training recipes. The OpenThoughts project aims to create open-source datasets and models to democratize research in this area. This paper details the creation of OpenThoughts3-1.2M, a large-scale, open reasoning dataset, and OpenThinker3-7B, a model trained on this dataset that achieves state-of-the-art performance among open-data models at its scale.

The core of the paper is a systematic empirical investigation into different stages of the data curation pipeline for supervised finetuning (SFT) reasoning models. Through over 1,000 ablation experiments across math, code, and science domains, the authors explored various strategies for generating question-answer pairs. The goal was to identify the most effective data recipes for enhancing reasoning. The pipeline experiments were conducted by generating 31,600 data points for each strategy and finetuning the Qwen2.5-7B-Instruct model, evaluating performance on a suite of reasoning benchmarks.

The key stages of the OpenThoughts3 data pipeline investigated include:

Question Sourcing: Evaluating numerous sources (synthetic, semi-synthetic, non-synthetic) for math, code, and science questions. The experiments showed significant variance in performance based on the source, with no single generation strategy consistently outperforming others across all domains. Top-performing sources for code included StackExchange CodeGolf and OpenCodeReasoning, for math included OpenMath-2-Math and NuminaMath, and for science included StackExchange Physics and LLM-extracted questions from organic chemistry texts.
Mixing Questions: Investigating whether mixing questions from multiple sources improves performance. Counter-intuitively, mixing more than a few high-quality sources degraded performance. The results suggest that focusing on question quality from a limited number of top sources is more effective than increasing diversity by including lower-ranked sources.
Question Filtering: Exploring various methods to select high-quality questions from large pools. LLM-based methods, such as filtering by LLM-labeled difficulty (best for code) or selecting questions eliciting long LLM responses (best for math and science), significantly outperformed classical methods like FastText classifiers or embedding-based filters.
Deduplication and Sampling Multiple Answers: Analyzing the interplay between question deduplication and generating multiple answers per question from the teacher model. While deduplication aims to increase question diversity, sampling multiple answers per question ( $16\times$ sampling yielded good results) increases answer diversity for a potentially smaller set of unique questions. The experiments showed that high sampling rates (e.g., 16x) were beneficial, sometimes even without strict deduplication, suggesting that answer diversity can be a strong lever for performance.
Answer Filtering: Testing various techniques to filter out low-quality or incorrect answers generated by the teacher. Surprisingly, none of the tested answer filtering strategies consistently outperformed simply using all generated answers. This suggests that for the evaluated scenarios, the benefits of filtering do not outweigh the reduction in dataset size or computational cost.
Teacher Model Selection: Comparing the performance of different teacher models (DeepSeek-R1, Phi-4-Reasoning-Plus-14B, QwQ-32B) for generating reasoning traces. Despite scoring lower on target benchmarks empirically, QwQ-32B proved to be a stronger teacher model than DeepSeek-R1 across all domains. This finding challenges the intuition that a higher-scoring model is always a better teacher for distillation.

Based on these findings, the authors constructed the OpenThoughts3-1.2M dataset, comprising approximately 850,000 math, 250,000 code, and 100,000 science data points. This composition was based on the ratio found effective in the previous OpenThoughts2-1M dataset. The dataset was generated by applying the best strategies identified in the pipeline experiments: selecting high-quality question sources (OpenMath-2-Math for math, CodeGolf and OpenCodeReasoning for code, StackExchange Physics and OrganicChemistryPDFs for science), using LLM-based filtering (difficulty for code, response length for math/science), performing exact deduplication for math/science and no deduplication for code, sampling 16 answers per question, and using QwQ-32B as the teacher without applying answer filtering.

The resulting model, OpenThinker3-7B, trained on OpenThoughts3-1.2M, achieved state-of-the-art performance among open-data models at the 7B scale across various reasoning benchmarks, including held-out tasks like AIME 2025 and LiveCodeBench 06/24-01/25. The paper emphasizes that these results were achieved solely through innovations in SFT data curation, without employing additional techniques like reinforcement learning.

Key insights from the paper include:

Scaling data through repeated sampling from a teacher is effective.
A model's benchmark performance doesn't perfectly correlate with its effectiveness as a teacher.
Extensive answer verification and filtering did not provide significant benefits in their setup.
Quality and domain expertise of question sources are more critical than simple diversity from many sources.
LLM-based methods are powerful tools for question filtering based on characteristics like difficulty or response length.

The paper acknowledges limitations, such as not exploring RL or staged SFT, and highlights open questions regarding cross-domain transfer effects when data is mixed, the potential for weak-to-strong generalization, and the interaction between question and answer diversity across different scales and domains.

For practical implementation, the paper provides details on the specific question sources used, the types of LLM prompts for filtering, the deduplication methods, and the teacher models. The release of the OpenThoughts3-1.2M dataset and OpenThinker3-7B model, along with the codebase, provides a valuable resource for practitioners and researchers to build upon these findings. The systematic experimental methodology serves as a recipe for how to empirically optimize data pipelines for SFT. Appendix sections also offer practical insights into training hyperparameters, evaluation setup (using Evalchemy), data decontamination methods, the importance of long reasoning traces (self-reflection and length), findings on selecting shortest answers during inference (which are often more accurate and computationally cheaper), and the computational resources required. The safety analysis indicates a trade-off where improved reasoning capabilities can come at the cost of increased harmfulness without specific safety alignment data.