OpenThinker3-7B Model
- OpenThinker3-7B is a 7-billion-parameter open-source reasoning model that uses a tailored dataset and systematic supervised fine-tuning to excel in benchmark tasks.
- The model employs the Qwen-2.5-7B-Instruct transformer architecture with 32 layers and 4096 hidden units, optimized solely for reasoning without additional innovations.
- Benchmark results show a +12.4 percentage point improvement over comparable models, demonstrating the efficacy of 16-answer augmentation and precise data filtering methods.
OpenThinker3-7B is a 7-billion-parameter open-source reasoning model, designed as a high-performing, fully auditable alternative to proprietary reasoning systems. Developed as part of the OpenThoughts initiative, it demonstrates state-of-the-art results on leading math, code, and science reasoning benchmarks using only public data. The model is primarily a supervised fine-tune of Qwen-2.5-7B-Instruct, with its performance driven by a specifically engineered dataset generation pipeline and systematic distillation strategy from a larger teacher model. All artifacts—datasets, weights, prompts, training code—are released under open licenses.
1. Model Architecture and Capacity
OpenThinker3-7B adopts the unmodified Qwen-2.5-7B-Instruct transformer architecture, comprising 32 transformer layers, each with a hidden size of 4096, 32 multi-head self-attention heads, and rotary positional embeddings. No architectural innovations such as Mixture-of-Experts or retrieval augmentations are introduced. The model’s full capacity is reserved for reasoning tasks, with all training and inference architectural hyperparameters retained from the base model (Guha et al., 4 Jun 2025).
| Attribute | OpenThinker3-7B Value | Notes |
|---|---|---|
| Parameter count | 7B | |
| Layers | 32 | Transformer |
| Hidden size | 4096 | |
| Attention heads | 32 | |
| RoPE | Yes | All layers |
2. Dataset Construction and Data Pipeline
The OpenThoughts3-1.2M dataset, central to OpenThinker3-7B’s performance, consists of 1,200,000 reasoning examples: 850,000 math questions, 250,000 code questions, and 100,000 science questions. Each dataset item is a triplet containing a prompt, a chain-of-thought (CoT), and an answer.
The pipeline is exhaustively tuned through over 1,000 controlled ablation experiments at 31.6k-scale, informing optimal choices at each stage:
- Question sourcing: Selects top two data sources per domain rather than maximizing diversity. Math examples use OpenMath-2-Math; code uses StackExchange CodeGolf and OpenCodeReasoning; science uses Physics.SE and organic chemistry PDFs.
- Filtering: Math and science questions undergo LLM-based "response-length" filtering using GPT-4.1-mini; code questions are filtered for difficulty by GPT-4o-mini based on ICPC rubric.
- Deduplication and answer sampling: Math and science domains apply exact string deduplication; code domain skips deduplication. For all, sixteen distinct answers per question are sampled from the teacher for SFT augmentation.
- Answer filtering: All samples from the teacher are used; empirical testing shows that strategies like LLM verification or unit tests do not improve performance.
- Teacher selection: QwQ-32B, a 32B RL-trained reasoning model, is used for generating all CoT traces and answers.
- Decontamination: To prevent eval data leakage, questions overlapping with eval benchmarks by ≥75% Indel or any shared 13-gram are excluded, where
3. Distillation Strategy and Training Protocol
OpenThinker3-7B is trained via direct supervised fine-tuning of Qwen-2.5-7B-Instruct on the OpenThoughts3 dataset using a straightforward cross-entropy loss over the teacher’s generated completions (CoT and answer tokens). No reinforcement learning or value-head losses are involved. Classic knowledge distillation with temperature scaling was considered, but outperformed by vanilla cross-entropy for CoT distillation.
Key large-set regime training parameters include AdamW optimizer (, , no weight decay), cosine decay learning rate scheduling with 10% linear warmup, peak learning rate , batch size 512, five epochs, and sequence packing via greedy algorithms. Training involves 25,000 A100-GPU hours, with additional 22,000 H100-GPU hours used for data annotation via QwQ-32B (Guha et al., 4 Jun 2025).
4. Benchmark Results and Comparative Performance
OpenThinker3-7B is evaluated with Evalchemy across 12 benchmarks, including advanced mathematical, coding, and science tasks (AIME 2025, LiveCodeBench 06/24-01/25, and GPQA Diamond). Reported metrics are mean accuracy with standard errors not exceeding 0.7 percentage points. The model achieves:
- 53.3% on AIME 2025
- 51.7% on LiveCodeBench 06/24-01/25
- 53.7% on GPQA Diamond
These results represent a +12.4 percentage point average improvement over DeepSeek-R1-Distill-Qwen-7B, and +2.1 points over the next-best open-data 8B model. Sampling multiple answers per question (16× augmentation) is identified as a driver of substantial downstream gains. Notably, supervised fine-tuning on QwQ-32B outputs outperforms using a nominally "stronger" teacher on held-out benchmarks (Guha et al., 4 Jun 2025).
| Model | AIME 25 | LCB 06/24–01/25 | GPQA-D | Avg. |
|---|---|---|---|---|
| OpenThinker3-7B | 53.3 | 51.7 | 53.7 | 55.3 |
| DeepSeek-R1-Distill-Qwen-7B | 39.7 | 30.7 | 24.6 | 24.0 |
| Next-best open-data 8B model | 50.7 | 44.3 | 52.9 | 52.9 |
5. Key Methodological Findings and Ablation Insights
Several outcomes from pipeline ablations and controlled experiments delineate data-centric best practices for SFT-based reasoning models:
- Utilizing 16 sampled answers per question for SFT scaling produces large accuracy gains with minimal complexity.
- Restricting data mixing to top 1–2 sources per domain, rather than maximizing diversity, improves overall dataset quality.
- LLM-based question filtering strategies outperform classical text embedding or classifier (e.g., FastText) approaches.
- Contrary to prior art, further answer filtering (using LLM verification, unit tests, or majority-vote) does not yield performance benefits; all generated answers are used directly.
- The empirically optimal teacher for SFT is not necessarily the highest-scoring model on the target benchmarks; QwQ-32B produces better SFT result than higher-ranked DeepSeek-R1 (Guha et al., 4 Jun 2025).
6. Limitations and Open Directions
OpenThinker3-7B surfaces several unresolved topics and limitations:
- The effect of tailoring the data-generation recipe to each domain versus cross-domain averaging remains unexplored.
- It is uncertain whether further scaling of data and model size leads to diminishing returns, especially as student model performance converges toward that of the teacher.
- The interaction between dataset scale, answer diversity, and question diversity at larger scales is not fully understood.
- All OpenThinker generations to date show a degradation in safety alignment; approaches to reinforce safety without impairing reasoning capability remain an open challenge.
- The potential for reinforcement learning from human feedback (RLHF) or actor-critic distillation to further improve performance is a subject for future research (Guha et al., 4 Jun 2025).
7. Release, Reproducibility, and Community Impact
Full data artifacts (OpenThoughts3-1.2M), model weights, prompts, and training code for OpenThinker3-7B are publicly released at https://openthoughts.ai, enabling thorough reproducibility and facilitating targeted interventions in model and data recipes. This open release model directly addresses opacity in state-of-the-art reasoning research, providing a foundation for further benchmarking, scaling, and methodological innovation in large-scale, open-source reasoning systems (Guha et al., 4 Jun 2025).